Method, device, and computer program for transmitting media content

ABSTRACT

A method for encapsulating encoded timed media data into at least a first and a second track belonging to one same group of tracks, said media data corresponding to one or more video sequences made up of full frames. The method includes for at least first or second track providing descriptive information about the spatial relationship of a first spatial part of one frame encapsulated in the first track. A second spatial part of said frame encapsulated in the second track, wherein said descriptive information, shared by the tracks belonging to a same group of tracks, indicates whether the region, covered by both the first and the second spatial parts, forms a full frame or not.

CROSS REFERENCE TO RELATED APPLICATION

The present application is a continuation of U.S. patent applicationSer. No. 17/255,374, filed on Dec. 22, 2020, which is a continuation ofPCT Application No. PCT/EP2019/066334, filed on Jun. 20, 2019. Thisapplication claims the benefit under U.S.C. § 119(a)-(d) of UnitedKingdom Patent Application No. 1810563.5, filed on Jun. 27, 2018, andtitled “METHOD, DEVICE, AND COMPUTER PROGRAM FOR TRANSMITTING MEDIACONTENT”. The above cited patent application is incorporated herein byreference in its entirety.

FIELD OF THE INVENTION

The present invention relates to methods and devices for encapsulatingand transmitting media data.

BACKGROUND OF THE INVENTION

The invention is related to encapsulating, media content, e.g. accordingto ISO Base Media File Format as defined by the MPEG standardizationorganization, to provide a flexible and extensible format thatfacilitates interchange, management, editing, and presentation of mediacontent and to improve its delivery for example over an IP network suchas Internet using adaptive http streaming protocol.

The International Standard Organization Base Media File Format (ISOBMFF, ISO/IEC 14496-12) is a well-known flexible and extensible formatthat describes encoded timed media data bit-streams either for localstorage or transmission via a network or via another bit-stream deliverymechanism. An example of extensions is ISO/IEC 14496-15 that describesencapsulation tools for various NAL (Network Abstraction Layer) unitbased video encoding formats. Examples of such encoding formats are AVC(Advanced Video Coding), SVC (Scalable Video Coding), HEVC (HighEfficiency Video Coding), and L-HEVC (Layered HEVC). Another example offile format extensions is ISO/IEC 23008-12 that describes encapsulationtools for still images or sequence of still images such as HEVC StillImage. Another example of file format extensions is ISO/IEC 23090-2 thatdefines the omnidirectional media application format (OMAF). The ISOBase Media file format is object-oriented. It is composed of buildingblocks called boxes (or data structures characterized by a fourcharacters code) that are sequentially or hierarchically organized andthat define parameters of the encoded timed media data bit-stream suchas timing and structure parameters. In the file format, the overallpresentation is called a movie. The movie is described by a movie box(with the four character code ‘moov’) at the top level of the media orpresentation file. This movie box represents an initializationinformation container containing a set of various boxes describing thepresentation. It is logically divided into tracks represented by trackboxes (with the four character code ‘trak’). Each track (uniquelyidentified by a track identifier (track_ID)) represents a timed sequenceof media data belonging to the presentation (frames of video, forexample). Within each track, each timed unit of data is called a sample;this might be a frame of video, audio or timed metadata. Samples areimplicitly numbered in sequence. The actual sample data are stored inboxes called Media Data Boxes (with the four character code ‘mdat’) atthe same level as the movie box. A description of the samples is storedin the metadata part of the file in a SampleTableBox. The movie can beorganized temporally as a movie box containing information for the wholepresentation followed by a list of couple movie fragment and Media Databoxes. Within a movie fragment (box with the four character code ‘moof’)there is a set of track fragments (box with the four character code Van,zero or more per movie fragment. The track fragments in turn containzero or more track run boxes (‘trun’), each of which document acontiguous run of samples for that track fragment.

An ISOBMFF file may contain multiple encoded timed media databit-streams or sub-parts of encoded timed media data bit-streams formingmultiple tracks. When sub-parts corresponds to one or successive spatialparts of a video source, taken over the time (e.g. at least onerectangular region, sometimes called ‘tile’, taken over the time), thecorresponding multiple tracks may be called sub-picture tracks. ISOBMFFand its extensions comprise several grouping mechanisms to grouptogether tracks, static items, or samples. A group typically sharescommon semantic and/or characteristics.

For instance, ISOBMFF comprises an entity group mechanism, a track groupmechanism, and a sample grouping mechanism. The entity groupingmechanism can be used to indicate that tracks and/or static items aregrouped according to an indicated grouping type or semantic. The trackgrouping mechanism can be used to indicate that tracks are groupedaccording to an indicated grouping type or semantic. The sample groupingmechanism can be used to indicate that certain properties associatedwith an indicated grouping type or semantic apply to an indicated groupof samples within a track. For example, sub-picture tracks from a samesource may be grouped using the track group mechanism.

To improve user experience, timed media data bit-streams (videos andeven audio) may be recorded in very high definition videos (e.g. 8k by4k pixels or more). To improve user experience and in particular tooffer immersive experience, timed media data bit-streams (videos andeven audio) may be omnidirectional (or multi-directional orpluri-directional). When applied to videos, also known as 360° panoramicvideo, the user feels to be located in the scene that is displayed.

An omnidirectional video may be obtained from a 360° camera and/or bycombining images of video streams obtained from several cameras, forexample mounted on a special rig so that all the cameras have a commonnodal point. Such a combination of images is known as image stitching orcamera stitching.

Such an omnidirectional video may be rendered via head mounted displaysaccording to the user's viewing orientation or through projection onto acurved screen surrounding users. It may also be displayed on traditional2D screens with navigation user interface to pan into theomnidirectional video according to user's desired part of theomnidirectional video (also known as viewport). It is often referred toas virtual reality (VR) since the user feels to be in a virtual world.When virtual objects are added to the omnidirectional video, it isreferred to as augmented reality (AR).

The inventors have noticed several problems when describing andsignaling information about the media data to transmit, in particularwhen the media content is split into several sub-parts carried bymultiple sub-picture tracks.

An example involves the signaling of sub-picture tracks requesting aspecific parsing process from the client, which generates overhead andis complex.

Another example concerns the signaling of group of tracks or sub-picturetracks and in particular the possible association between these groupsof tracks or sub-picture tracks.

Another example involves the signaling of the sub-picture tracks thatare allowed or not to be combined to rebuild an omnidirectional mediacontent ready for display. The existing solutions are either complex ornot well defined and not fully compliant with existing mechanisms fortwo dimensional multi-tracks encapsulation process.

SUMMARY OF THE INVENTION

The present invention has been devised to address one or more of theforegoing concerns.

In this context, there is provided a solution for streaming mediacontent (for example omnidirectional media content), for example over anIP network such as Internet using the http protocol.

According to a first aspect of the invention there is provided a methodfor encapsulating encoded timed media data into at least a first and asecond track belonging to one same group of tracks, said media datacorresponding to one or more video sequences made up of full frames,

-   -   the method comprising for at least first or second track    -   providing descriptive information about the spatial relationship        of a first spatial part of one frame encapsulated in the first        track, with a second spatial part of said frame encapsulated in        the second track, wherein said descriptive information, shared        by the tracks belonging to a same group of tracks, indicates        whether the region, covered by both the first and the second        spatial parts, forms a full frame or not.

In particular, each group shares a particular characteristic or thetracks within a group have a particular relationship.

In an embodiment, said descriptive information is provided in a samedata structure comprising descriptive information shared by all thetracks of the group of tracks.

In an embodiment, the data structure is a TrackGroupTypeBox.

In an embodiment, said descriptive information comprising a parameterprovided for all the tracks of the group of tracks, taking a first valuewhen the region covered by the first and the second spatial part is afull frame and a second value when the region covered by the first andthe second spatial part is not a full frame.

In an embodiment, said descriptive information further comprisingparameters for signaling the missing spatial parts from the full frame,when the region covered by the first and the second spatial parts is notthe full frame.

According to a second aspect of the invention, it is proposed a methodfor encapsulating encoded timed media data into a plurality of tracksbelonging to at least a first or a second group of tracks of a samegroup type,

-   -   wherein the method comprising for the tracks of the plurality of        tracks belonging to the first group of tracks        -   providing descriptive information indicating that at least            one track belonging to the first group of tracks and at            least one track belonging to the second group of tracks are            switchable.

In an embodiment, said descriptive information is shared by all thetracks belonging to the first group of tracks.

In an embodiment, said descriptive information is provided in a samedata structure comprising descriptive information shared by all thetracks of the group of tracks.

In an embodiment, the data structure comprising identifiers forsignaling the groups of tracks whose tracks are switchable with thetracks of the first group of tracks.

In an embodiment, said descriptive information is a dedicated datastructure containing only one or more parameters signaling the groups oftracks whose tracks are switchable with the tracks of the first group oftracks.

According to a third aspect of the invention, there is provided a methodfor encapsulating encoded media data corresponding to a wide view of ascene, the method comprising:

-   -   obtaining a projected picture from the wide view of the scene;    -   packing the obtained projected picture in at least one packed        picture;    -   splitting the at least one packed picture into at least one        sub-picture;    -   encoding the at least one sub-picture into a plurality of        tracks;    -   generating descriptive metadata associated the encoded tracks,    -   wherein the descriptive metadata comprise an item of information        associated with each track being indicative of a spatial        relationship between the at least one sub-picture encoded in the        track and the at least one projected picture.

According to a fourth aspect of the invention, there is provided amethod for generating a media file comprising

-   -   capturing one or more video sequences made up of full frames,    -   encoding media data corresponding to the frames of the one or        more video sequences,    -   encapsulating the encoded media data into at least a first and a        second track belonging to one same group of tracks according to        the encapsulating method of claim 1, and    -   generating at least one media file comprising said first and        second tracks.

According to a fifth aspect of the invention, there is provided a methodfor obtaining at least one frame from a media file comprising encodedtimed media data encapsulated into at least a first and a second trackbelonging to one same group of tracks, said media data corresponding toone or more video sequences made up of full frames,

-   -   the method comprising    -   parsing information associated with the first and the second        track,    -   wherein the parsed information comprising descriptive        information about the spatial relationship of a first spatial        part of one frame encapsulated in the first track, with a second        spatial part of said frame encapsulated in the second track,        said descriptive information shared by all the tracks of the        group of tracks, indicating whether the region covered by both        the first and the second spatial part forms a full frame or not.

According to a sixth aspect of the invention, there is provided a methodfor generating a media file comprising

-   -   encoding media data,    -   encapsulating the encoded media data into a plurality of tracks        belonging to at least a first or a second group of tracks,        according to the encapsulating method of claim 8, and    -   generating at least one media file comprising said first and        second tracks.

According to a seventh aspect of the invention, there is provided amethod for obtaining media data from a media file comprising encodedtimed media data encapsulated into a plurality of tracks belonging to atleast a first or a second group of tracks of a same group type,

-   -   the method comprising    -   parsing information associated with the first and the second        track,        -   wherein the parsed information comprising descriptive            information indicating that at least one track belonging to            the first group of tracks and at least one track belonging            to the second group of tracks are switchable.

According to an eighth aspect of the invention, there is provided acomputing device for encapsulating encoded timed media data into atleast a first and a second track belonging to one same group of tracks,said media data corresponding to one or more video sequences made up offull frames,

-   -   the computing device being configured for at least first or        second track    -   providing descriptive information about the spatial relationship        of a first spatial part of one frame encapsulated in the first        track, with a second spatial part of said frame encapsulated in        the second track, wherein said descriptive information, shared        by the tracks belonging to a same group of tracks, indicates        whether the region, covered by both the first and the second        spatial parts, forms a full frame or not.

According to a ninth aspect of the invention, there is provided acomputing device for encapsulating encoded timed media data into aplurality of tracks belonging to at least a first or a second group oftracks of a same group type,

-   -   the computing device being configured for    -   providing, for the tracks of the plurality of tracks belonging        to the first group of tracks, descriptive information indicating        that at least one track belonging to the first group of tracks        and at least one track belonging to the second group of tracks        are switchable.

According to a tenth aspect of the invention, there is provided acomputing device for obtaining at least one frame from a media filecomprising encoded timed media data encapsulated into at least a firstand a second track belonging to one same group of tracks, said mediadata corresponding to one or more video sequences made up of fullframes,

-   -   the computing device being configured for:    -   parsing information associated with the first and the second        track,    -   wherein the parsed information comprising descriptive        information about the spatial relationship of a first spatial        part of one frame encapsulated in the first track, with a second        spatial part of said frame encapsulated in the second track,        said descriptive information shared by all the tracks of the        group of tracks, indicating whether the region covered by both        the first and the second spatial part forms a full frame or not.

According to a eleventh aspect of the invention, there is provided acomputing device for obtaining media data from a media file comprisingencoded timed media data encapsulated into a plurality of tracksbelonging to at least a first or a second group of tracks of a samegroup type,

-   -   the computing device being configured for:    -   parsing information associated with the first and the second        track,        -   wherein the parsed information comprising descriptive            information indicating that at least one track belonging to            the first group of tracks and at least one track belonging            to the second group of tracks are switchable.

According to a twelfth aspect of the invention, there is provided acomputer program product for a programmable apparatus, the computerprogram product comprising a sequence of instructions for implementing amethod according to any one of claims 1 to 14, when loaded into andexecuted by the programmable apparatus.

According to a thirteenth aspect of the invention, there is provided acomputer-readable storage medium storing instructions of a computerprogram for implementing a method according to any one of claims 1 to14.

According to a fourteenth aspect of the invention, there is provided acomputer program which upon execution causes the method of any one ofclaims 1 to 14 to be performed.

BRIEF DESCRIPTION OF THE DRAWINGS

Further advantages of the present invention will become apparent tothose skilled in the art upon examination of the drawings and detaileddescription. It is intended that any additional advantages beincorporated herein.

Embodiments of the invention are described below, by way of examplesonly, and with reference to the following drawings in which:

FIGS. 1 a and 1 b illustrate examples of a data flow for capturing,processing, encapsulating, transmitting, and rendering anomnidirectional video from a server to a client;

FIGS. 2 a and 2 b represent block diagrams illustrating examples ofencapsulation according to embodiments of the invention;

FIG. 3 is a schematic block diagram of a computing device forimplementation of one or more embodiments of the invention;

FIG. 4 a describes an example of sub-picture track encapsulationcontaining several track groups for 2D spatial relationshipsdescription;

FIG. 4 b illustrates, according to a second aspect of the invention, anexample about an alternative way to indicate that groups are equivalentgroups;

FIG. 5 illustrates an example of another embodiment where the indicationof equivalent track groups is provided outside track declarationaccording to the second aspect of the invention;

FIG. 6 illustrates an example of use of theSpatialRelationship2DdescriptionBox and the source_id according toembodiments of the invention;

FIG. 7 illustrates the sub-picture encapsulation according toembodiments of the invention of the third aspect of the invention;

FIG. 8 illustrates the parsing process according to embodiments of theinvention;

FIG. 9 illustrates a system according to embodiments of the presentinvention;

FIGS. 10 a, 10 b, 10 c and 10 d illustrate the several example of theoverall process of projection, optional packing and splitting intosub-picture tracks according to embodiments of the invention;

FIG. 11 illustrates an embodiment of a relation between a set ofsub-picture tracks and a source image, according to an embodiment of thefirst aspect of the invention;

FIG. 12 illustrates an example of track groups for 2D spatialrelationships with additional information related to reconstruction,according to an embodiment of the first aspect of the invention;

FIG. 13 , comprising FIGS. 13 a and 13 b , illustrate explicitreconstruction from alternative sets of sub-picture tracks, according toan embodiment of the second aspect of the invention; and

FIG. 14 illustrates the extractor resolution by File/segmentde-encapsulation means according to the invention, for example with anISOBMFF parser, according to an embodiment of the second aspect of theinvention.

DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

FIG. 1 a illustrates an example of a system 10 implementing atransmitting method. The system 10 allows to flow media data, (forexample 2D images). The system comprises a server device 101 and aclient device 170, said media data being transmitted from the serverdevice 101 to the client device 170. As illustrated, the media data canbe a video sequence 1011 captured by a camera system 100 and deliveredto the client device 170, to be displayed on a 2D screen 175 (TV,tablet, smartphone . . . ), by a user for example.

The images 1011 forming the video sequence, are split by splitting means1012 into spatial parts 1013 to be independently encoded by encodingmeans 140, in a preferred embodiment. Independently encoded means thatone spatial part does not use any data from another spatial part asreference for differential or predictive encoding. For example, when theencoding means 140 is based on HEVC (High Efficiency Video Coding)compression format, the spatial parts 1013 can be encoded as independenttiles. In an alternative embodiment, the spatial parts 1013 can beencoded as motion-constrained tiles. The encoding means provide as manybitstreams as spatial parts or one bitstream with N independentsub-bitstreams (e.g. when HEVC is used for encoding independent tiles).Then, each provided bitstream or sub-bitstream is encapsulated byFile/segment encapsulating means 150 into multiple sub-picture tracks1014. The packaging or encapsulation format can be for example accordingto ISO Base Media File Format and ISO/IEC 14496-15, as defined by theMPEG standardization organization. Resulting file or segment files canbe mp4 file or mp4 segments. During the encapsulation, audio stream maybe added to the video bit-stream as well as metadata tracks providingdescriptive information (metadata) about the video sequence or on theadded audio streams.

The encapsulated file or segment files are then delivered to the clientdevice 170 via delivery means 160, for example over IP network likeInternet using http (HyperText Transfer Protocol) protocol or on aremovable digital medium such as for example a disk or a USB key. Forthe sake of illustration, the delivery means 160 implement an adaptivestreaming over HTTP such as DASH (Dynamic Adaptive Streaming over HTTP)from the MPEG standardization committee (“ISO/IEC 23009-1, Dynamicadaptive streaming over HTTP (DASH), Part1: Media presentationdescription and segment formats”). The delivery means may comprise astreaming server 161 and a streaming client 162. The media presentationdescription may provide description and URLs for media segmentscorresponding to the track encapsulating a video sequence comprisingfull images or to the sub-picture tracks only or both. The mediapresentation description may provide alternative groups of sub-picturetracks, each group allowing different reconstruction level of the scenecaptured by the camera 110. Alternative can be for example in terms ofresolution, quality or bitrate, different splits (coarse or fine gridassociated with the splitting means 1013).

Upon reception by the streaming client 162, the encapsulated media fileor media segments are parsed by File/segment de-encapsulating means 171so as to extract one or more data streams. The extracted data stream(s)is/are decoded at by decoding means 172. In case of ISOBMFF file orsegments received by the File/segment de-encapsulating means 171, theparsing is typically handled by an mp4 reader or mp4 parser. From thedescriptive metadata, the parser can extract encapsulated videobitstreams and/or video sub-bitstreams.

Next, optionally the decoded images or sub-images of the video sequenceprovided by the decoding means 172 are composed by rendering means 174into resulting images for video rendering. The rendered video is anddisplayed on displaying means 175 like a screen (user device).

It is to be noted that video rendering depends on several parametersamong which is the display size or the processing power of the client.The rendering may then consist in displaying only a subset of the parsedand decoded sub-picture tracks. This may be controlled by the renderingmeans 174 or directly in content selection by the streaming client 162.

It has been observed that transmission and rendering of several imagesof VHD (for ‘Very High Definition’) video streams may lead to a veryhigh bitrate and very high resolution media data stream. Therefore, whentaking into account the whole system, to avoid wasting bandwidth and toremain compliant with processing capabilities of the client players,there is a need to optimize access to the media data.

Such a need is even more important that a media data stream may be usedfor specific applications. In particular, a media data stream can beused for displaying images with dedicated displays like an array ofprojectors. It can also be used to display particular region of interestin the captured video 110.

FIG. 1 b illustrates another example of a system 11 implementing atransmitting method. The system 11 allows to flow omnidirectional mediadata. As illustrated, this media has a video content acquired from acamera system 100 and delivered to head-mounted display (HMD) 170 and176. The camera system 100 may contain one camera with a wide angle lensor a set of multiple cameras assembled together (for example a camerarig for virtual reality). The delivery means 160 may perform a deliveryfor example over an IP network 163 such as Internet using an adaptivehttp streaming protocol via the streaming server 161 and the streamingclient 162.

For the sake of illustration, the used camera system 100 is based on aset of six standard cameras, associated with each face of a cube. It isused to capture images representing the real scene surrounding thecamera system. According to this arrangement, one camera provides frontimages, one camera provides rear images, one camera provides leftimages, one camera provides right images, one camera provides bottomimages, and one camera provides top images.

The images obtained from camera system 100 are processed by imageprocessing means in server 101 to create 360 images forming anomnidirectional video stream also called a 360 video stream or a virtualreality media data stream.

The processing means 120 allows stitching and projecting captured imagesof the same time instance. Images are first stitched and projected ontoa three-dimensional projection structure representing a sphere 121forming a 360° view in both horizontal and vertical dimensions. The 360image data on the projection structure is further converted onto atwo-dimensional projected image 122 (also denoted a capturingprojection), for example using an equirectangular projection(https://en.wikipedia.org/wiki/Equirectangular_projection). Theprojected image covers the entire sphere.

Alternatively, if the omnidirectional media is a stereoscopic 360-degreevideo, the camera system 100 may be composed of multiple camerascapturing image sequences representing a left view and a right view thatcan be used later on by the client to render a three-dimensional360-degree scene. In such a case, the processing means 120 describedabove process both left-view and right-view images sequences separately.Optionally, frame packing may be applied by stereoscopic frame packingmeans 125, to pack each left view image and right view image of the sametime instance onto a same projected image resulting on one singleleft+right projected images sequence. Several stereoscopic frame packingarrangements are possible, for instance, side-by-side, top-bottom,column based interleaving, row based interleaving, temporal interleavingof alternating left and right views. Alternatively, a stereoscopic framepacking arrangement may also consist in keeping left and right views inseparate and independent projected images sequence resulting inindependent video bit-streams after encoding by encoding means 140. Forexample, one video bit-stream represents the left view images and theother one does the right view images.

Optionally, region-wise packing by region-wise packing means 130 is thenapplied to map the projected image 122 onto a packed image 131.Region-wise packing consists in applying transformation (e.g. rotation,mirroring, copy or move of pixel blocks . . . ), resizing, andrelocating of regions of a projected image in order for instance tomaximize signal information on the most useful parts of the sphere forthe user. It can be noted that the packed image may cover only a part ofthe entire sphere. If the region-wise packing is not applied, the packedimage 131 is identical to the projected image 122. In case ofstereoscopic omnidirectional media, region-wise packing applies eitheron the left+right projected images sequence, or separately on theleft-view and right-view projected images sequences depending on theframe packing arrangement chosen by the stereoscopic frame-packing means125.

The projected images 122 or packed images 131 are encoded by theencoding means 140 into one or several video bit-streams. In case ofstereoscopic omnidirectional media, encoding step applies either on theleft+right packed images sequence, or separately on the left-view andright-view packed images sequences depending on the frame packingarrangement chosen by the stereoscopic frame-packing means 125.Alternatively, Multi-View encoding can be used on the left-view andright-view packed images sequences.

Examples of encoding formats are AVC (Advanced Video Coding), SVC(Scalable Video Coding), HEVC (High Efficiency Video Coding) or L-HEVC(Layered HEVC). In the following, HEVC is used to refer to both HEVC andto its layered extensions (L-HEVC).

HEVC and similar video encoding formats define different spatialsubdivisions of samples, e.g. pictures: tiles, slices and slicesegments. A tile defines a rectangular region of a picture that isdefined by horizontal and vertical boundaries (i.e., rows and columns)and that contains an integer number of Coding Tree Units (CTUs) orcoding blocks, all referred to hereinafter coding units. As such, tilesare good candidates to represent spatial sub-parts of a picture.However, coded video data (bit-stream) organization in terms of syntaxand its encapsulation into NAL units (or NALUs) is rather based onslices and slice segments (as in AVC).

A slice in HEVC is a set of slice segments, with at least the firstslice segment being an independent slice segment, the others, if any,being dependent slice segments. A slice segment contains an integernumber of consecutive (in raster scan order) CTUs. The slice does notnecessarily have a rectangular shape (it is thus less appropriate thantiles for spatial sub-part representations). A slice segment is encodedin the HEVC bit-stream as a slice_segment_header followed byslice_segment_data. Independent slice segments (ISS) and dependent slicesegments (DSS) differ by their header: the dependent slice segment has ashorter header because reusing information from the independent slicesegment's header. Both independent and dependent slice segments containa list of entry points in the bit-stream.

When a video bit-stream is encoded with tiles, tiles can bemotion-constrained to ensure that tiles do not depend from neighborhoodtiles in the same picture (spatial dependency) and from neighborhoodtiles in previous reference pictures (temporal dependency). Thus,motion-constrained tiles are independently decodable.

Alternatively, the projected image 122 or packed image 131 can be splitby splitting means into several spatial sub-pictures before encoding,each sub-picture being encoded independently forming for instance anindependent encoded HEVC bit-stream.

Alternatively, the region-wise packing means 130 and the splitting intoseveral spatial sub-pictures by splitting means can operatesimultaneously without generating in memory the complete intermediatepacked image 131. The projected image 122 (or the resulting stereoscopicprojected image after the optional region-wise packing) can be splitinto sub parts and each sub part can be directly packed into a spatialsub-picture to be encoded by encoding means 140.

FIGS. 10 a, 10 b, 10 c and 10 d illustrate the several example of theoverall process of projection, optional packing and splitting intosub-picture tracks implemented in means 125, 130 or 1012 for example,according to embodiments of the invention. One or more regions from theprojected picture 1001 (noted 1, 2, 3 and 4) are rearranged into packedregions 1002 (noted 1′, 2′, 3′ and 4′) by applying several transformoperations (identity, up or down scaling, rotation, mirroring,relocation . . . ) and then split and reorganized into one or moresub-picture tracks 1003. The splitting may also lead to one sub-picturetrack per packed region (1′, 2′, 3′ or 4′). Packing and splittingoperations may also be conducted at once, directly from the projectedpicture 1011 to one or more sub-picture tracks 1012. FIGS. 10 c and 10 dprovide examples of different possible encapsulation in case theomnidirectional content is stereo content. In such case, the capturingstep 110 uses a camera rig allowing stereoscopic recording, typicallyone video per eye.

FIG. 10 c depicts an example of stereoscopic omnidirectional contentwhere there is no frame packing (means 125 for the optional framepacking on FIG. 1 ). Then, each projected view 1021 is independentlyencapsulated, possibly into multiple sub-picture tracks like 1023 whenregion-wise packing is applied to each view (in 1022). In this example,there is one sub-picture track per region of each view. One could evendecide to encapsulate both views of a same region in the samesub-picture track. Then the sub-picture track would contain a stereovideo box at sample description level indicating the frame packing used.

FIG. 10 d depicts an example of stereoscopic omnidirectional contentwhere there is a frame packing (means 125 for the optional framepacking) applied in order to pack the two projected views 1031 in asingle frame-packed picture 1032. Then, the resulting frame-packedpicture 1032 is encapsulated, possibly into multiple sub-picture trackslike in 1033. In this example, each sub-picture track describes bothviews for a given spatial region. As for the projection followed bypacking, one sub-picture track may encapsulate one region or manyregions (as depicted on FIG. 10 ). An encapsulation module may decide ona description cost versus access granularity trade-off to encapsulatethe content into sub-picture tracks containing multiple packed regionsfor example. This may be the case when the encapsulation by computinginverse projection of the packed region finds that there is no gap inthe inverse projection of contiguous regions in the packed frame. Thismay be a decision criterion to group these regions from the packedpicture into a single sub-picture track.

FIGS. 10 a, 10 b, 10 c and 10 d illustrates such gathering of severalregions in a same sub-picture track. In case, the encapsulation modulegathers multiple regions in a sub-picture track that generate gaps,holes or uncovered pixels in the projected picture, it may set the subpicture track positions and sizes equal to the positions and sizes ofthe bounding box of these multiple regions.

Therefore, as result of the encoding performed by the encoding means140, the projected image 122 or packed image 131 can be represented byone or more independent encoded bit-streams or by at least one encodedbit-stream composed of one or more independently encodedsub-bit-streams.

Those encoded bit-streams and sub-bit-streams are then encapsulated bythe encapsulating means 150 in a file or in small temporal segment files165 according to an encapsulation file format, for instance according toISO Base Media File Format and Omnidirectional MediA Format(OMAF-ISO/IEC 23090-2) as defined by the MPEG standardizationorganization. The resulting file or segment files can be mp4 file or mp4segments. During the encapsulation, audio stream may be added to thevideo bit-stream as well as metadata tracks providing information on thevideo or on the audio streams.

The encapsulated file or segment files are then delivered to client 170via a delivery mechanism 160, for example over Internet using http(HyperText Transfer Protocol) protocol or on a removable digital mediumsuch as for example a disk. For the sake of illustration, the delivery160 is performed using an adaptive streaming over HTTP such as DASH(Dynamic Adaptive Streaming over HTTP) from the MPEG standardizationcommittee (“ISO/IEC 23009-1, Dynamic adaptive streaming over HTTP(DASH), Part1: Media presentation description and segment formats”).

This standard enables association of a compact description of the mediacontent of a media presentation with HTTP Uniform Resource Locations(URLs). Such an association is typically described in a file called amanifest file or a description file 164. In the context of DASH, thismanifest file is an XML file also called the MPD file (MediaPresentation Description).

By receiving an MPD file, a client device 170 gets the description ofeach media content component. Accordingly, it is aware of the kind ofmedia content components proposed in the media presentation and knowsthe HTTP URLs to be used for downloading, via the streaming client 162,the associated media segments 165 from the streaming server 161.Therefore, the client 170 can decide which media content components todownload (via HTTP requests) and to play (i.e. to decode and to playafter reception of the media segments).

It is to be noted that the client device can only get media segmentscorresponding to a spatial part of full packed images representing awide view of the scene depending on the user's viewport (i.e. part ofthe spherical video that is currently displayed and viewed by the user).The wide view of the scene may represent the full view represented bythe full packed image.

Upon reception, the encapsulated virtual reality media file or mediasegments are parsed by the means 171 so as to extract one or more datastreams that is/are decoded by the decoding means 172. In case ofISOBMFF file or segments received by the means 171, the parsing istypically handled by an mp4 reader or mp4 parser that, from thedescriptive metadata, can extract encapsulated video bit-streams and/orvideo sub-bit-streams.

Next, optionally the packed images or packed sub-images provided to themeans 173 the decoding means 172 are unpacked to obtain the projectedimages that are then processed for video rendering (rendering means 174)and displayed (displaying means 175).

Alternatively packed sub-images may be rearranged to composeintermediate full packed images before being unpacked into projectedpictures.

It is to be noted that video rendering depends on several parametersamong which is the point of view of the user, the point of sight, andthe projection(s) used to create the projected images. As illustrated,rendering the video comprises a step of re-projecting on a sphere thedecoded projected images. The images obtained from such a re-projectionare displayed in the Head-Mounted display 176.

For handling stereoscopic views, the process described by reference toFIG. 1 may be duplicated or partially duplicated.

It has been observed that stitching several images of UHD (Ultra HighDefinition) video streams into panorama images of a virtual realitymedia data stream leads to a very high bitrate and very high resolutionvirtual reality media data stream. Therefore, from a system'sperspective and to avoid wasting bandwidth and to remain compliant withprocessing capabilities of the client players, there is a need tooptimize access to the virtual reality media data.

Such a need is even more important that a virtual reality media datastream may be used for other purposes than the one described byreference to FIG. 1 . In particular, a virtual reality media data streamcan be used for displaying 360° images with specific displays like a360° array of projectors. It can also be used to display particularfield of view and/or change the point of view, the field of view, andthe point of sight.

According to particular embodiments, encoded bit-streams andsub-bit-streams resulting from the encoding of a packed image 131 (means140 of FIG. 1 ) are encapsulated into a file or into small temporalsegment files according to an encapsulation file format, for instanceISO Base Media File Format (ISO/IEC 14496-12 and ISO/IEC 14496-15),Omnidirectional MediA Format (OMAF) (ISO/IEC 23090-2) and associatedspecifications as defined by the MPEG standardization organization.

An encoded bit-stream (e.g. HEVC) and possibly its sub-bit-streams (e.g.tiled HEVC, MV-HEVC, scalable HEVC), can be encapsulated as one singletrack. Alternatively multiple encoded bit-streams that are spatiallyrelated (i.e. are sub-spatial parts of a projected image) can beencapsulated as several sub-picture tracks. Alternatively, an encodedbit-stream (e.g. tiled HEVC, MV-HEVC, scalable HEVC) comprising severalsub-bit-streams (tiles, views, layers) can be encapsulated as multiplesub-picture tracks.

A sub-picture track is a track embedding data for a sub-part, typicallya spatial part or rectangular region, of a picture or image. Asub-picture track may be related to other sub-picture tracks or to thetrack describing the full picture the sub-picture is extracted from. Forexample a sub-picture track can be a tile track. It can be representedby an AVC track, an HEVC track, an HEVC tile track or any compressedvideo bit-stream encapsulated as a sequence of samples.

A tile track is a sequence of timed video samples corresponding to aspatial part of an image or to a sub-picture of an image or picture. Itcan be for example a region of interest in an image or an arbitraryregion in the image. The data corresponding to a tile track can comefrom a video bit-stream or can come from a sub part of a videobit-stream. For example a tile track can be an AVC or HEVC compliantbit-stream or can be a sub-part of AVC or HEVC or any encodedbit-stream, like for example HEVC tiles. In a preferred embodiment, atile track is independently decodable (encoder took care to removemotion prediction from other tiles by generating “motion-constrained”tiles). When tile track corresponds to a video bit-stream encoded inHEVC with tiles, it can be encapsulated into an HEVC Tile track denotedas ‘hvt1’ track as described in ISO/IEC 14496-15 4th edition. It canthen refer to a tile base track to obtain parameter sets, high levelinformation to set up the video decoder. It can also be encapsulatedinto a HEVC track ‘hvc1’ or ‘hev1’ track. A tile track can be used forspatial composition of sub-pictures into a bigger image or picture.

A tile base track is a track common to one or more tile tracks thatcontain data or metadata that is shared among these one or more tracks.A tile base track may contain instructions to compose images from one ormore tile tracks. Tile tracks may depend on a tile base track forcomplete decoding or rendering. When tile base track derives from avideo bit-stream encoded in HEVC with tiles, it is encapsulated into anHEVC track denoted as ‘hvc2’ or ‘hev2’ track. In addition it isreferenced by HEVC tile tracks via a track reference ‘tbas’ and it shallindicate the tile ordering using a ‘sabt’ track reference to the HEVCtile tracks as described in ISO/IEC 14496-15 4th edition.

A composite track (also denoted reference track) is a track that refersto other tracks to compose an image. One example of composite track is,in case of video tracks, a track composing sub-picture tracks into abigger image. This can be done by post-decoding operation, for examplein a track deriving from video tracks that provides transformation andtransformation parameters to compose the images from each video track toa bigger image. A composite track can also be a track with extractor NALunits providing instructions to extract NAL units from other videotracks or tile tracks to form before decoding a bit-stream resultingfrom sub-bit-stream concatenation. A composite track can also be a trackthat implicitly provides composition instructions, for example throughtrack references to other tracks. A composite track may help therendering performed by rendering means 174 for spatial composition ofsub-picture tracks by providing bitstream concatenation or samplereconstruction rules. The bitstream concatenation or samplereconstruction rules may be defined for each sample, for example usingone or more extractor NAL units or they may be defined at track level,for example via track references like in tile base track.

ISO/IEC 14496-12 provides a box denoted ‘trgr’ located at track level(i.e. within the ‘trak’ box in ISOBMFF box hierarchy) to describe groupsof tracks, where each group shares a particular characteristic or wherethe tracks within a group have a particular relationship. This trackgroup box is an empty container defined as follows:

Box Type: ‘trgr’ Container: TrackBox(‘trak’) Mandatory: No Quantity:Zero or one aligned(8) class TrackGroupBox extends Box(‘trgr’) { }

This track group box can contain a set of track group type boxes definedas follows:

aligned(8) class TrackGroupTypeBox(unsigned int(32) track_group_type) extends FullBox(track_group_type, version = 0, flags = 0) {  unsignedint(32) track_group_id;  // the remaining data may be specified for aparticular track_group_type }

The particular characteristic or the relationship declared by aninstance of this track group type box is indicated by the box type(track_group_type). This box also includes an identifier(track_group_id), which can be used to determine the tracks belonging tothe same track group. All the tracks having a track group box with atrack group type box having the same track_group_type and track_group_idvalues are part of the same track group. The box also allows declarationof specific parameters associated with the track for a particular trackgroup type.

The MPEG ISOBMFF standard (ISO/IEC 14496-12 7^(th) edition Amendment1—May 2018) is proposing a specific track groupSpatialRelationship2DDescriptionBox for two dimensional spatialrelationship as a TrackGroupTypeBox of type ‘2dcc’.

Spatial Relationship2DDescriptionBox TrackGroupTypeBox withtrack_group_type equal to ‘2dcc’ indicates that this track belongs to agroup of tracks with 2D spatial relationships (e.g. corresponding toplanar spatial parts of a video source). ASpatialRelationship2DDescriptionBox TrackGroupTypeBox with a giventrack_group_id implicitly defines a coordinate system with an arbitraryorigin (0, 0) and a maximum size defined by total_width andtotal_height; the x-axis is oriented from left to right and the y-axisfrom top to bottom. The tracks that have the same value of source_idwithin a SpatialRelationship2DDescriptionBox TrackGroupTypeBox aremapped as being originated from the same source and their associatedcoordinate systems share the same origin (0, 0) and the orientation oftheir axes. When only one track group for 2D spatial relationship ispresent in a file, the source_id parameter is optional. A source orvideo source corresponds to the content being captured by a camera or aset of cameras for omnidirectional content. For example, a veryhigh-resolution video could have been split into sub-picture tracks.Each sub-picture track then conveys its position and sizes in the sourcevideo.

The two dimensional spatial relationship track group of type ‘2dcc’ isdefined as below:

aligned(8) class SpatialRelationship2DSourceBox  extends FullBox(‘2dsr’,0, 0) {  unsigned int(32) total_width;  unsigned int(32) total_height; unsigned int(32) source_id; } aligned(8) class SubPictureRegionBoxextends FullBox(‘sprg’,0,0) {  unsigned int(16) object_x;  unsignedint(16) object_y;  unsigned int(16) object_width;  unsigned int(16)object_height; } aligned(8) class SpatialRelationship2DDescriptionBoxextends TrackGroupTypeBox(‘2dcc’) {  // track_group_id is inherited fromTrackGroupTypeBox;  SpatialRelationship2DSourceBox( );  // mandatory,must be first  SubPictureRegionBox ( );  // optional }where

-   -   object_x specifies the horizontal position of the top-left        corner of the track within the region specified by the enclosing        track group. The position value is the value prior to applying        the implicit resampling caused by the track width and height, if        any, in the range of 0 to total_width −1, inclusive, where        total_width is defined by the enclosing track group,    -   object_y specifies the vertical position of the top-left corner        of the track within the region specified by the enclosing track        group. The position value is the value prior to applying the        implicit resampling caused by the track width and height, if        any, in the range of 0 to total_height −1, inclusive, where        total_height is defined by the enclosing track group,    -   object_width specifies the width of the track within the region        specified by the enclosing track group. The position value is        the value prior to applying the implicit resampling caused by        the track width and height, if any, in the range of 1 to        total_width, inclusive, where total_width is defined by the        enclosing track group,    -   object_height specifies the height of the track within the        region specified by the enclosing track group. The position        value is the value prior to applying the implicit resampling        caused by the track width and height, if any, in the range of 1        to total_height, inclusive, where total_height is defined by the        enclosing track group,    -   total_width specifies, in pixel units, the maximum width in the        coordinate system of the ‘srd’ track group. The value of        total_width shall be the same in all instances of        SpatialRelationshipDescriptionBox with the same value of        track_group_id,    -   total_height specifies, in pixel units, the maximum height in        the coordinate system of the ‘srd’ track group. The value of        total_height shall be the same in all instances of        SpatialRelationshipDescriptionBox with the same value of        track_group_id, and    -   source_id is an optional parameter providing a unique identifier        for the source. It implicitly defines a coordinate system        associated to this source.

SubPictureRegionBox( ) is an optional box providing the static positionsand sizes of the track within the region specified by the enclosingtrack group.

If SubPictureRegionBox( ) is present in theSpatialRelationship2DDescriptionBox, then there shall be no associatedSpatialRelationship2DGroupEntry in the associated track (this track hasa constant, static, size and position).

If SubPictureRegionBox( ) is not present in theSpatialRelationship2DDescriptionBox, then there shall be one or moreassociated SpatialRelationship2DGroupEntry(s) in the associated track(this track possibly has a dynamic size and/or position).

The SpatialRelationship2DGroupEntry( ) defining the ‘2dcc’ samplegrouping allows declaring the positions and sizes of the samples from asub-picture track in a two dimensional spatial relationship track group.Version 1 of the SampleToGroupBox shall be used when grouping_type isequal to ‘2dcc’. The value of grouping_type_parameter shall be equal totrack_group_id of the corresponding spatial relationship track group.

The SpatialRelationship2DGroupEntry( ) is defined as follows:

class SpatialRelationship2DGroupEntry ( ) extends VisualSampleGroupEntry(‘2dcc’) {  unsigned int(16) object_x;  unsigned int(16) object_y; unsigned int(16) object_width;  unsigned int(16) object_height; }where

-   -   object_x specifies the horizontal position of the top-left        corner of the samples in this group within the coordinate system        specified by the corresponding spatial relationship track group.        The position value is the value prior to applying the implicit        resampling caused by the track width and height, if any, in the        range of 0 to total_width −1, inclusive, where total_width is        included in the corresponding        SpatialRelationship2DDescriptionBox,    -   object_y specifies the vertical position of the top-left corner        of the samples in this group within the coordinate system        specified by the corresponding spatial relationship track group.        The position value is the value prior to applying the implicit        resampling caused by the track width and height, if any, in the        range of 0 to total_height −1, inclusive, where total_height is        included in the corresponding        SpatialRelationship2DDescriptionBox,    -   object_width specifies the width of the samples in this group        within the coordinate system specified by the corresponding        spatial relationship track group. The position value is the        value prior to applying the implicit resampling caused by the        track width and height, if any, in the range of 1 to        total_width, inclusive, and    -   object_height specifies the height of the samples in this group        within the coordinate system specified by the corresponding        spatial relationship track group. The position value is the        value prior to applying the implicit resampling caused by the        track width and height, if any, in the range of 1 to        total_height, inclusive.

The samples of each track in a ‘2dcc’ track group can be spatiallycomposed with samples (at the same composition or decoding time) fromother tracks in this same group to produce a bigger image.

Depending on encoded bit-streams and sub-bit-streams resulting from theencoding of a packed image 131 (step 140 of FIG. 1 ), several variantsof encapsulation in file format are possible.

FIGS. 2 a and 2 b represents block diagrams illustrating examples offile/segment encapsulation (implemented in means 150 of FIG. 1 )according to an embodiment of the invention.

FIG. 2 a illustrates steps for encapsulating (by means 150) of 2D videointo multiple tracks. At step 2200, the server determines whether theinput bitstream(s) after encoding, are to be encapsulated as a single ormultiple tracks. If single track encapsulation is on, the video isencapsulated as a single track, optionally with a NAL unit mappingindicating which NAL units correspond to which region. If multipletracks have to be generated (test 2200 ‘true’), for example when a splits performed by means 1122 in FIG. 1 a , then in step 2220, the contentcreator of the files may add a composite track. A composite track allowsproviding an entry point or a “main” or “default” track for the parsersor players. For example, the composite track has the flags values set inthe track header indicating that it is enable and that it is used inmovie and optionally as preview. The tracks referenced by the compositetracks may not have these flags value set (except the track_enable flagsvalue) to hide these track from selection by clients or players orusers. When there is no composite track, the media file and eachbitstream or sub-bitstream after encoding is encapsulated in its owntrack in step 2230. An optional step may consist in reducing the numberof tracks by gathering bitstreams or sub-bitstreams to form biggerregions than the original split ones. When the encapsulation provides acomposite track (test 2220 is ‘true’), two options are possible for asample reconstruction rule: implicit or explicit reconstructionindication in the media file.

For implicit reconstruction (test 2240 is ‘true’, branch ‘yes’), thecomposite track is provided as a tile base track (e.g. tracks with‘hvt1’ sample entry) as defined by ISO/IEC 14496-15 in step 2241. Theneach sub-picture track is encapsulated as a tile track depending on thistile base track in step 2243, as specified in ISO/IEC 14496-15. Notethat in addition to the ‘trif’ descriptor for tile tracks, each tiletrack may also be declared as part of a same track group for 2D spatialrelationship description.

If the composite track is provided as a track with extractor forexplicit reconstruction (test 2240 is ‘false’, branch ‘no’), anadditional track is created in the media file. This created trackreferences each sub-picture track created in step 2444, for example witha ‘sca1’ track reference type. If no composite track is provided (test2220 is ‘false’, branch ‘no’), then the video part of the media isencapsulatedas sub-picture tracks in a step 2230. Note that even if acomposite track is present, the sub-picture track may also be groupedvie the track group mechanism.

Finally, the description for spatial composition and the relationshipbetween the sub-picture tracks is generated at step 2250. A track groupbox for 2D spatial relationship description is added to each sub-picturetrack to describe the relative positions and sizes of each sub-picturetrack within the original video source.

According to an embodiment of the invention, additional spatialinformation may be provided. This additional information may beadditional signaling as described more in detail by reference to FIGS.12 and 13 .

The additional information will allow the media parsers or media playersto reconstruct the video to display (displaying means in FIGS. 1 a and b).

In an alternative, if no additional information is provided in step2250, the parser may infer the information from other data in thebitstream.

FIG. 2 b : at step 200, the server determines if there are severalspatially-related video bit-streams (i.e. representing spatial sub-partof packed images and for which a spatial composition may create a biggerimage) or if there are video bit-streams comprising videosub-bit-streams representing either motion-constrained tiles or multipleviews that can be exposed to the client as multiple sub-picture tracks.If the encoded packed image cannot be exposed as multiple tracks becauseit is encoded as a single video bit-stream or the content creator doesnot wish to expose the encoded packed image as multiple tracks, thenvideo bit-stream or video sub-bit-streams are encapsulated into onesingle track (step 210). Otherwise, it is determined at step 220 if themedia content to be encapsulated is composed of video sub-bit-streamsrepresenting motion-constrained tiles. If yes, at least one compositetrack may need to be provided to represent at least one composition ofseveral tile tracks. The composition may represent the full packedimages or only a sub-part of the full packed images. Using a compositetrack with tile tracks avoids requiring separate rendering and decodingof streams on the client-side. The number of possible combinations to beexposed to the client depends on content creator's choices. Forinstance, the content creator may want to combine tiles with differentvisual qualities depending on current user's viewport. For this, it canencode several times a packed image with different visual qualities andpropose several composite tracks representing the full packed imagecomprising different combination of tiles in terms of visual qualities.By combining tiles at different qualities depending on user's viewport,the content creator can reduce the consumption of network resources.

If at step 220, it is determined that composite tracks must be provided,it is then determined if implicit reconstruction can be used or not forthe composite track (step 240).

Implicit reconstruction refers to bit-stream reconstruction from tilebase and tile tracks, for instance as defined in ISO/IEC 14496-15 4thedition. Rather than using in-stream structure such as extractors tore-build samples of a composite track from samples of tile tracks byreplacing extractors in composite track's samples by the data theyreference in tile tracks' samples, implicit reconstruction allowsre-building composite track's samples by concatenating samples of thecomposite track and tile tracks in the order of track references (e.g.‘sabt’ track references in HEVC implicit reconstruction).

The use of implicit reconstruction depends on the scenario of use. Whenthe composition of several tile tracks requires a rearrangement of thetiles at the decoding compared to the order of tiles at the encoding,then some slice addresses must be rewritten. In such a case, implicitreconstruction is not possible and explicit reconstruction withextractors must be selected.

If implicit reconstruction is possible, a tile base track is generated(step 241), and the video sub-bit-streams are encapsulated as tiletracks not decodable independently (e.g. as HEVC ‘hvt1’ tracks).

Otherwise an extractor track is generated (step 242), and the videosub-bit-streams are encapsulated as tile tracks decodable independently(e.g. as HEVC ‘hvc1’ or ‘hev1’ tracks).

Going back to step 220, if the media content does not contain tilesub-bit-streams or the content creator does not want to create andexpose composite tracks, then spatially-related video bit-streams orvideo sub-bit-streams (e.g. tile or multiple views) are encapsulatedinto separate sub-picture tracks (step 230). In such particular case, ifthe tile sub-bit-streams are HEVC tiles, they are encapsulated as HEVCtrack ‘hvc1’ or ‘hev1’ track.

At step 250, signaling for spatial composition is added to grouptogether spatially-related video bit-streams or video sub-bit-streams.Spatial composition signaling can be provided by defining a specificTrackGroupTypeBox in each track (sub-picture tracks, tile tracks,composite tracks) that composes the group, for instance a track group oftype ‘2dcc’ with same track_group_id for all tracks pertaining to thesame group as defined in MPEG ISOBMFF (ISO/IEC 14496-12 7^(th) editionAmendment 1) as previously described.

This track group box ‘2dcc’ would provide the relative two-dimensionalcoordinates of the track within the composition and the overall size ofthe image formed by the composition. The composition may represententire packed images or only a sub-part of packed images. For instance,the content creator may want to expose multiple composite tracksallowing building the entire packed images or only sub-part of packedimages.

Alternatively, the composition may represent entire projected images oronly a sub-part of projected images.

Parameters from ‘2dcc’ track group (track_group_id, source_id,total_width, total_height, object_x, object_y, object_width,object_height) directly match the parameters of the DASHSpatial-Relationship Description (SRD) descriptor (defined in ISO/IEC23009-1 3^(rd) edition) that can be used in a DASH manifest to describethe spatial relationship of Adaptation Sets representing those tracks:

-   -   track_group_id would match the DASH SRD spatial_set_id        parameter,    -   source_id would match the DASH SRD source_id parameter (when not        present, the default value “1” may be used, since mandatory in        DASH SRD),    -   object_x, object_y, object_width, object_height would match the        DASH SRD parameters object_x, object_y, object_width,        object_height parameters respectively, and    -   total_width and total_height from the associated track group        (via the track_group_id) would match the DASH SRD total_width,        total_height parameters.

As an alternative, in case there is a composite track, spatialcomposition signaling can be provided implicitly by this compositetrack. Indeed, in case the composite track is a tile base track, thetile base track refers to a set of tile tracks via a track reference oftype ‘sabt’. This tile base track and set of tile tracks forms acomposition group. Similarly, if the composite track is an extractortrack, the extractor track refers to a set of tile tracks via a trackreference of type ‘scal’. This extractor track and set of tile tracksalso forms a composition group. In both cases, relative two-dimensionalcoordinates of each tile track within the composition can be provided bydefining a sample grouping or default sample grouping of type ‘trif’ asdefined in ISO/IEC 14496-15 4th edition.

As another alternative, spatial composition signaling can be provided bydefining a new entity group. An entity group is a grouping of items ortracks. Entity groups are indicated in a GroupsListBox in a MetaBox.Entity groups referring to tracks may be specified in GroupsListBox of afile-level MetaBox or in GroupsListBox of a movie-level MetaBox. TheGroupListBox (‘grpl’) contains a set of full boxes, each called anEntityToGroupBox, with an associated four-character codes denoting adefined grouping type. The EntityToGroupBox is defined as follows:

aligned(8) class EntityToGroupBox(grouping_type, version, flags) extendsFullBox(grouping_type, version, flags) {  unsigned int(32) group_id; unsigned int(32) num_entities_in_group;  for(i=0;i<num_entities_in_group; i++)   unsigned int(32) entity_id; // theremaining data may be specified for a particular grouping_type }

Typically group_id provides the id of the group and the set of entity_idprovides the track_ID of the tracks that pertains to the entity group.Following the set of entity_id, it is possible to extend the definitionof the EntityToGroupBox by defining additional data for a particulargrouping_type. According to an embodiment, a new EntityToGroupBox withfor instance grouping_type equal to ‘egco’ (for Entity GroupComposition) can be defined to describe the composition of twodimensional spatially-related video bit-streams or videosub-bit-streams. The set of entity_id would contains the set of track_IDof tracks (sub-pictures, tile tracks, composite tracks) that composes agroup. The overall size of the image formed by the composition can beprovided as part of additional data associated to this new grouping_type‘egco’.

EntityToGroupBox(‘egco’) would be defined as follows:

aligned(8) class EntityToGroupBox(‘egco’, version, flags) extendsFullBox(‘egco’, version, flags) {  unsigned int(32) group_id;  unsignedint(32) num_entities_in_group;  for(i=0; i<num_entities_in_group; i++)  unsigned int(32) entity_id;  unsigned int(16) total_width;  unsignedint(16) total_height;  unsigned int(32) source_id; }where total_width and total_height provide the size of the compositionand the optional source_id parameter provides a unique identifier forthe source and implicitly defines a coordinate system (i.e., an origin(0, 0) and the orientation of their axes) associated to the source.

Compared with DASH, group_id would match the DASH SRD spatial_set_idparameter, source_id would match the DASH SRD source_id parameter, andtotal_width and total_height would match the DASH SRD total_width andtotal_height parameters, respectively. When source_id is not present inthe EntityToGroupBox for composition, the default value “1” is used tomap to DASH MPD. In case the MPD describes multiple media content, thenit is up to the MPD generator to handle and allocate source_id valuesthat allow distinguishing one media content from another media content.

The relative two-dimensional coordinates of each track within thecomposition defined by an entity grouping of type ‘egco’ can be providedby defining a track group of type (‘egco’) as defined below:

aligned(8) class SubPictureRegionBox extends FullBox(‘sprg’,0,0) { unsigned int(16) object_x;  unsigned int(16) object_y;  unsignedint(16) object_width;  unsigned int(16) object_height; } aligned(8)class SpatialRelationship2DDescriptionBox extendsTrackGroupTypeBox(‘2dcc’) {  // track_group_id is inherited fromTrackGroupTypeBox;   SubPictureRegionBox ( ); }where object_x, object_y, object_width, and object_height provide therelative two-dimensional coordinates of each track in the composition.

A given EntityToGroupBox of type ‘egco’ is associated with thecorresponding SpatialRelationship2DDescriptionBox by defining a group_idequals to track_group_id.

Alternatively, the relative two-dimensional coordinates of each trackwithin the composition defined by an entity grouping of type ‘egco’ canbe provided by defining a sample grouping or default sample grouping oftype ‘trif’ in each tile track as defined in ISO/IEC 14496-15 4thedition. As an alternative, relative two-dimensional coordinates can bedefined as a new generic full box 2DCoordinateForEntityGroupBox(‘2dco’)that would be located in VisualSampleEntry in each tile track pertainingto a group:

aligned(8) class 2DCoordinateForEntityGroupBox extends FullBox(‘2dco’,version, flags) {  unsigned int(32) entity_group_id;  unsigned int(16)object_x;  unsigned int(16) object_y;  unsigned int(16) object_width; unsigned int(16) object_height; }where

-   -   entity_group_id provides the identifier of the associated        EntityToGroupBox(‘egco’) defining the group,    -   object_x and object_y provide the horizontal and vertical        position of the top-left corner of samples of this track within        the composition, and    -   object_width and object_height provide the width and height of        the samples of this track within the composition.

As an alternative, this new generic box2DCoordinateForEntityGroupBox(‘2dco’) can be defined as a new samplegrouping as follows:

class 2DCoordinateForEntityGroupBox extendsVisualSampleGroupEntry(‘2dco’) {  unsigned int(32) entity_group_id; unsigned int(16) object_x;  unsigned int(16) object_y;  unsignedint(16) object_width;  unsigned int(16) object_height; }

Turning back to FIG. 2 b , region-wise packing information for the trackis added to the metadata describing the encapsulation of videobit-streams or video sub-bit-streams, at step 260. This step is optionalwhen the sub-picture track is not further rearranged into regions.

Region-wise packing provides information for remapping of a luma samplelocation in a packed region onto a luma sample location of thecorresponding projected region. In MPEG OMAF, region-wise packing may bedescribed according to following data structure:

aligned(8) class RegionWisePackingStruct( ) {  unsigned int(1)constituent_picture_matching_flag;  bit(7) reserved = 0;  unsignedint(8) num_regions;  unsigned int(32) proj_picture_width;  unsignedint(32) proj_picture_height;  unsigned int(16) packed_picture_width; unsigned int(16) packed_picture_height;  for (i = 0; i < num_regions;i++) {   bit(3) reserved = 0;   unsigned int(1) guard_band_flag[i];  unsigned int(4) packing_type[i];   if (packing_type[i] == 0) {   RectRegionPacking(i);    if (guard_band_flag[i])     GuardBand(i);  }  } }where

-   -   proj_picture_width and proj_picture_height specify the width and        height, respectively, of the projected picture, in relative        projected picture sample units,    -   packed_picture_width and packed_picture_height specify the width        and height, respectively, of the packed picture, in relative        packed picture sample units,    -   num_regions specifies the number of packed regions when        constituent_picture_matching_flag is equal to 0. When        constituent_picture_matching_flag is equal to 1, the total        number of packed regions is equal to 2*num_regions and the        information in RectRegionPacking(i) and GuardBand(i) applies to        each stereo constituent picture of the projected picture and the        packed picture,    -   RectRegionPacking(i) specifies the region-wise packing between        the i-th packed region and the i-th projected region (i.e.        convert x, y, width, height coordinates from packed region to        projected region with optional transforms (rotation,        mirroring)), and    -   GuardBand(i) specifies the guard bands, if any, for the i-th        packed region.

According to embodiments of the invention, when region-wise packinginformation is defined in a sub-picture track, this structure onlydescribes the packing of the sub-picture track by reference to thecomplete projected picture. Thus packed_picture_width andpacked_picture_height are equals to sub-picture track's width andheight.

Optionally at step 270, content coverage information for the track andfor compositions of tracks is added to the metadata describing theencapsulation of video bit-streams or video sub-bit-streams. This stepis optional and uses the CoverageInformationBox as defined in ISO/IEC23090-2

For omnidirectional video, the CoverageInformationBox providesinformation on the area on the sphere covered by the content. The natureof the content depends on the Container of this box. When present in aSpatialRelationship2DDescriptionBox ‘2dcc’, the content refers to theentire content represented by all tracks belonging to the samesub-picture composition track group and a composition picture composedfrom these tracks is referred to as a packed picture of the entirecontent. When present in a sample entry of a track, the content refersto the content represented by this track itself, and the picture of asample in this track is referred to as a packed picture of the entirecontent. When no CoverageInformation Box is present for a track, itindicates that the content covers the entire sphere.

It is to be noted that for omnidirectional video, the Projectedomnidirectional video box (‘povd’) is an intermediate box defined byMPEG OMAF and located into a VisualSampleEntry in a track.

In addition, for omnidirectional video, theSpatialRelationship2DDescriptionBox track group box (‘2dcc’) may beextended as follows:

aligned(8) class SpatialRelationship2DDescriptionBox extendsTrackGroupTypeBox(‘2dcc’) {  // track_group_id is inherited fromTrackGroupTypeBox;  SpatialRelationship2DSourceBox( ); // mandatory,must be first  SubPictureRegionBox ( ); // optional CoverageInformationBox( ); // optional }

As a second embodiment, track coverage information and compositioncoverage information can be signaled using a single commonCoverageInformationBox with a flag value to distinguish local and globalindication. Since CoverageInformationBox is an ISOBMFF FullBox, thedistinction between track and global coverage can be expressed throughthe flags parameter of the box.

According to this second embodiment, the CoverageInformation Box isdefined as follows:

Box Type: ‘covi’ Container: Projected omnidirectional video box( ‘povd’)Mandatory: No Quantity: Zero or more aligned(8) classCoverageInformationBox extends FullBox(‘covi’, 0, 0) { ContentCoverageStruct( ) }

The structure of the box is almost the same as in previous embodimentexcept that multiple instances of the box can be defined in case localand composition coverage information must be defined in a same track.

The CoverageInformationBox is then defined as providing information onthe area on the sphere covered by the content. The nature of the contentis given by the flags parameter. The default value for the CoverageInformation flags is 0, meaning that this box describes the coverage ofthe entire content. If this track belongs to a two dimensional spatialrelationship track group, the entire content refers to the contentrepresented by all tracks belonging to the same two dimensional spatialrelationship track group, and a composition picture composed from thesetracks is referred to as a packed or projected picture of the entirecontent. Otherwise, the entire content refers to the content representedby this track itself, and the picture of a sample in this track isreferred to as a packed or projected picture of the entire content.

When the value for the Coverage Information flags is 1, this boxdescribes the spherical area covered by the packed or projected picturesof the content represented by this track.

The absence of this box indicates that the content covers the entiresphere.

In addition, a new flag value is defined as follows:

Coverage_local: Indicates that the coverage information is local to thetrack containing the box. Flag value is 0x000001. By default, this valueis not set.

Going back to FIG. 2 b , at step 280, it is checked if the virtualreality media content is actually stereoscopic virtual reality mediacontent, i.e. comprises left and right views.

If the content is only monoscopic, the process directly goes to step290.

If the content is stereoscopic, stereoscopic signalling is added to theencapsulation at step 285.

For stereoscopic content, classically, both left and right viewsequences are acquired from a stereoscopic camera and are compositedinto a video sequence or two video sequences according to a compositiontype.

The process to combine two frames representing two different views of astereoscopic content into one single frame is called frame packing (seestep 125 in FIG. 1 ).

Frame packing consists in packing two views that form a stereo pair intoa single frame. There exists several well-known and used frame packingschemes: side by side, top-bottom, frame sequential, vertical lineinterleaved type . . . . For example, the MPEG application formatISO/IEC 23000-11 1^(st) edition (“Stereoscopic video applicationFormat”) or ISO/IEC 23001-8 2^(nd) edition (“Coding-independentcode-points (CICP)”) defines some of these schemes. Frame packing canalso consist in keeping each view in separate frames like for examplethe VideoFramePackingType having the value 6 defined in ISO/IEC 23001-82 nd edition (“CICP”).

For instance, still according to this specification, the value 3 signalsthat each decoded frame contains a side-by-side packing arrangement ofcorresponding frames of two constituent views, the value 4 signals thateach decoded frame contains a top-bottom packing arrangement ofcorresponding frames of two constituent views.

In order to signal if a track contains stereoscopic media data, aStereoVideoBox is defined in VisualSampleEntry in the track.

Turning back to step 250 of FIG. 2 , theSpatialRelationship2DDescriptionBox is defined to match the definitionof the Spatial Relationship Descriptor ‘SRD’ as defined in DynamicAdaptive Streaming over HTTP (DASH) protocol (ISO/IEC 23009-1 3rdedition) to express spatial relationships between video tracks asprovided in the Table below:

ISOBMFF parameter DASH SRD parameter trgr::‘2dcc’::track_group_idspatial_set_id trgr::‘2dcc’::‘sprg’::object_x object_xtrgr::‘2dcc’::‘sprg’::object_y object_ytrgr::‘2dcc’::‘sprg’::object_width object_widthtrgr::‘2dcc’::‘sprg’::object_height object_heighttrgr::‘2dcc’::‘2dsr’::total_width total_widthtrgr::‘2dcc’::‘2dsr’::total_height total_heighttrgr::‘2dcc’::‘2dsr’::source_id source_id (when present) (when present)

A TrackGroupTypeBox with ‘2dcc’ track_grouping_type indicates that thetrack belongs to a group of tracks corresponding to spatial parts of avideo. The tracks that have the same value of source_id within aTrackGroupTypeBox of track_group_type ‘2dcc’ are mapped as beingoriginated from the same source (i.e. with same origin (0, 0), and sameorientation of their axes). More precisely, the complete compositionpictures (with size total_width and total_height) from two track groupswith same source_id are perceptually or visually equivalent (e.g. twocomposition pictures representing the same visual content at twodifferent resolutions or two different qualities). Adding a source_idparameter allows expressing whether two sets of sub-picture tracks aresharing a common referential (same source_id value) or not (differentsource_id values). The indication that two sets of sub-picture tracksshare a same referential may be interpreted as a possibility to combinethe sub-picture tracks from different sets for rendering (but this islet to the application: the ISOBMFF parser from the indication in theencapsulated file can inform the application about the possiblealternatives). The absence of the source_id parameter in the descriptionof the track group for 2D spatial relationship indicates that therelative positions between the two sets of sub-picture tracks areunknown or unspecified.

All sub-picture tracks belonging to a TrackGroupTypeBox with ‘2dcc’track_grouping_type and same track_group_id shall have the samesource_id, when present.

Tracks belonging to a TrackGroupTypeBox with ‘2dcc’ track_grouping_typeand different track_group_id are compatible and can be combined togetherif they have the same source_id. When source_id is present, tracksbelonging to a TrackGroupTypeBox with ‘2dcc’ track_grouping_type anddifferent track_group_id are not compatible and cannot be combinedtogether if they have a different value for their source_id Whensource_id parameter is not present in the description of aTrackGroupTypeBox with ‘2dcc’ track_grouping_type, this does not implythat sub-picture tracks from different track groups with ‘2dcc’track_grouping_type cannot be combined. There may be alternative toindicate such possibility for combination. For instance, in the case ofomnidirectional video, two sub-picture tracks do not represent sub partsof the same source when the two-dimensional projected picturerepresenting this source are not visually equivalent (e.g. they havedifferent projection format or different viewport orientations). In suchcase, they may be signalled with a different value of source_id in theirrespective description of track group for 2D spatial relationships.

As an alternative, this later rule applies even if it exists analternate group grouping sub-picture tracks from ‘2dcc’ track group withdifferent source_id. That means those sub-pictures track arealternatives (for instance they have different coding format, e.g. AVCand HEVC) but they are not intended to be combined with sub-picturetracks with different coding format.

When media content is split into sub-parts to encode and to encapsulateindividually, the resulting sub-picture tracks may benefit fromadditional descriptive information as explained by reference to steps2250 or 250 in FIG. 2 a or 2 b.

Indeed, from content generation point of view, splitting the contentinto spatial sub-parts provides adaptation to client's display orprocessing capabilities. As such, the media may be provided asalternative sets of sub-picture tracks covering more or less thecaptured image 1011 or 122. For example, server may encapsulate thesub-picture tracks with information indicating whether the set ofsub-picture tracks belonging to one track group, covers or not the wholesource image.

Moreover, when the whole source image is covered, it is advantageous toknow whether the set of sub-picture tracks exactly cover the wholesource image, or whether there are some overlaps. On the contrary, it isadvantageous to know if the whole source image is not covered. In thiscase, it is advantageous to know which part is exactly covered andwhether there are holes and where they are located.

Said information allows a client exploring the media file or a mediadescription file to retrieve the missing parts.

Having such information at client side helps the player to select thebest of sub-picture tracks according to their capacities or theapplication needs or to user choices.

A first aspect of the invention then proposes to improve the trackgroups for 2D spatial relationship description with indication about theset of sub-picture tracks with respect to the source image.

FIG. 11 illustrates an embodiment of a relation between a set ofsub-picture tracks and a source image, according to an embodiment of thefirst aspect of the invention. First, a captured image 1200 (e.g. 1011or 122 by reference to FIG. 1 a or 1 b) is split into tiles orrectangular regions or spatial sub-parts (8 regions in FIG. 11 ). Onthis large image, a region of interest is identified with potentialinterest for rendering, access or transmission 1201. The encapsulationthen generates a description of the captured image 1200 as differenttrack groups 1202 and 1203. 1202 corresponds to a set of sub-picturetracks that when composed together lead to the full picture 1200, asindicated by the information 1204 associated to the track group 1202. Aswell, the other track group 1203 has similar information 1204 but thistime indicating the reconstructed image from the composition of thesub-picture tracks in this track group would lead to a partial view ofthe source image 1200.

In this example, it is actually an encapsulation choice because anaccess to the region of interest 1201 is provided as a combination oftracks. The client then determines when deciding to render only theregion of interest the list of sub-picture tracks to process. There isno need to process all the sub-picture tracks. Optionally, when thetrack group does not lead to full reconstruction, the track groupdescription may provide additional information 1205 to explain why thereconstruction is partial. When encapsulating with ISOBMFF, information1204 and 1205 may be provided as illustrated in FIG. 12 .

FIG. 12 illustrates an example of track groups for 2D spatialrelationships with additional information related to reconstruction. Topreserve backward compatibility, a new version of the ‘2dcc’ box (1300)is proposed that for the part providing the group properties, the ‘2dsr’box 1301, indicates information 1303 on the set of sub-picture tracks:does it correspond to a “complete set” or not. “Complete set” set to ‘1’means that the reconstruction from the sub-picture tracks in this trackgroup will correspond to the full source image. “Complete set” set to‘0’ means that the reconstruction from the sub-picture tracks in thistrack group will not correspond to the full source image. In the lattercase, additional information may be provided 1304. For example, a set offlags can indicate whether gaps exist, or whether there are someoverlaps. When one or another is present, a list of gaps or overlaps maybe provided as a list of rectangular regions, using the ‘sprg’structure. In the case of omnidirectional content, the indication thatthe set of sub-picture tracks is not a complete set may be interpretedby a parser as an instruction to further inspect the media file, forexample by looking for a region wise packing description and by parsingthis description when present. For example in case an overlap indicationis present in 1304, the parser may determine whether the overlap is dueto the presence of guard-bands in the sub-picture tracks. In OMAF, thiscan be determined by inspecting the region-wise packing box ‘rwpk’ andchecking the guard_band_flag parameter. If the backward compatibility isnot an issue, then the additional indication can be directly inserted asadditional parameters in one part of the track group for 2D spatialrelationship. For example, the indication on complete_set may beprovided using 0 for both version and flags values, as follows:

aligned(8) class SpatialRelationship2DSourceBox extends FullBox(‘2dsr’,0, 0) {  unsigned int(32) total_width;  unsigned int(32) total_height; unsigned int(32) source_id;  unsigned int(2) reference_picture; unsigned int(1) complete_set;  unsigned int(29) reserved; }

where the semantics for total_width, total_height and source_id remainsunchanged and:

reference_picture (here represented by 2 bits) specifies the sourceimage that has been split into the sub-picture tracks of this trackgroup. When taking value “0”, indicates the positions for the subpicture tracks in this track group are expressed in the coordinatesystem of the captured picture (this is the default value). When takingvalue “1”, indicates the positions for the sub picture tracks in thistrack group are expressed in the coordinate system of the projectedpicture. When taking value 2, indicates the positions for the subpicture tracks in this track group are expressed in the coordinatesystem of the frame-packed picture. When taking value 3, indicates thepositions for the sub picture tracks in this track group are expressedin the coordinate system of the packed picture.In the above example, the additional information related toreconstruction (complete_set parameter) is mixed with source_id andreference picture. It may be provided as well when no information onsource_id is present or when no indication on the reference picture isprovided:

aligned(8) class SpatialRelationship2DSourceBox extends FullBox(‘2dsr’,0, 0) {  unsigned int(32) total_width;  unsigned int(32) total_height; unsigned int(1) complete_set;  unsigned int(30) reserved; }In an alternative embodiment, more bits could be allocated to theadditional information related to reconstruction. For example, using 2bits instead of one allows indicating to media players or ISOBMFFparsers whether the reconstruction from the set of sub-picture in thetrack group leads to complete reconstruction (for example when the 2bitstake value “00”, 0 in decimal), or if it leads to a subset of the fullpicture, i.e. reconstruction contains one or more gaps (for example whenthe 2bits take value “01”, 1 in decimal) or if it leads to a superset ofthe full picture, i.e. reconstruction contains parts which areoverlapping (for example when the 2 bits take the value “10”, 2 indecimal). When the 2 bits take the value “11”, 3 in decimal, thereconstruction contains both gaps and overlaps. When more than a simpleindication is used to describe information related to reconstruction,the parameters describing the reconstruction may be organized into adedicated descriptor in the track group description:

aligned(8) class SpatialRelationship2DDescriptionBox extendsTrackGroupTypeBox(‘2dcc’) {  // track_group_id is inherited fromTrackGroupTypeBox;  SpatialRelationship2DSourceBox( );   // mandatory,must be first  SubPictureRegionBox ( );   // optional ReconstructionInfoBox( );  // optional }Where ReconstructionInfoBox( ) may provide the following information onreconstruction: does the set of sub-picture tracks correspond to thefull source, or to a subset (gaps) or to a superset (overlap). Dependingon this value, description of where are the gaps is provided, forexample as well in case of overlap. Note that there may be both gaps andoverlap.Optionally, a parameter indicates the expected number of sub-picturetracks in the track group. This information, when present in the file,provides the number of sub-picture tracks expected for thereconstruction. For example, when set to 10, while a client, streamingor downloading the media file, does not have 10 sub-picture tracks inthe track group, it may not start the reconstruction of the samples. Tohandle dynamic number of expected sub-picture tracks along time, thisinformation may also be provided in the sample group for 2D spatialrelationships ‘2dcc’, so that it can be updated from one media fragmentto another. The indication of expected number of sub-picture tracksexpected for the reconstruction may also be provided within theproperties of the group, for example in the case of track groups for 2Dspatial relationships, in the ‘2dsr’ box.

The indication related to reconstruction from sub-picture tracks can becombined with source_indication (source_id parameter of ‘2dsr’), withreference picture signaling or with equivalent groups signalingdescribed below in a second aspect of the invention. It applies to 2D orto 360° media.

When applied to 360° media, the additional information related toreconstruction is relative to the reference picture indication, whenpresent in the description of the track group for 2D spatialrelationships. It may be binary information like the complete_setparameter. It may be the 2 bits value parameter. It may be a parameterindicating the percentage of the projected picture 122 covered by thereconstructed picture resulting from the combination of the sub-picturetracks. When the reference picture is not indicated, the additionalinformation related to reconstruction may indicate with a binary value00 that the projected picture 122 is fully covered or partially covered(binary value “01”), with a binary value “10” that the packed picture isfully covered or partially covered (value “11”). Depending on the valueof the first bit, a parser will determine whether region-wise packing isapplied to projected picture and may decide to further analyze the mediafile when the last bit indicates partial reconstruction. This additionalanalysis can be used to determine which parts are present or missing inthe reconstructed picture. When the last bit indicates fullreconstruction, there is no need to further parse or analyze the file todetermine that reconstruction is complete.

On the percentage of the reference picture or of the projected picturein 360° video case or of the source picture in 2D video case,optionally, in the part corresponding to track properties within thetrack group 1302, an additional parameter (not represented on FIG. 12 )may provide the contribution of the track to this percentage. Forexample, for a given group of sub-picture, when a sub-picture track hasa significant contribution to the reconstruction, it may be a goodindication to start downloading, streaming it first and reconstructingit first when a player implements progressive reconstruction.

FIG. 4 a describes an example of sub-picture track encapsulationcontaining several track groups for 2D spatial relationshipsdescription. This example applies for both 2D or for omnidirectionalvideo.

In this example, Tracks #1 to #4 belong to a track group 41 of type‘2dcc’ with track_group_id equals to 10 and source_id equals to 1.Tracks #5 to #8 belong to a different track group 42 of type ‘2dcc’ withtrack_group_id equal to 20 but with the same source_id 400 equals to 1.There is also a third track group 43 of type ‘2dcc’ with atrack_group_id equals to 30 and a different source_id 401 equal to 2. Inaddition, there are several alternate groups 44 to 47. All tracks thatbelong to the same alternate group (i.e. that have the samealternate_group identifier in their track header box ‘tkhd’) specify agroup or collection of tracks containing alternate data. Alternate datamay correspond to alternate bitrate, codec, language, packet size etc.These differentiating attributes may be indicated in a track selectionbox. Only one track within an alternate group should be played orstreamed at any one time. In this example, Tracks #1, #5 and #9 belongto the same alternate group 44 with identifier equal to 100. Forinstance, track #1 and track #5 are alternate tracks with differentqualities, and track #9 is an alternate track to track #1 and track #5in terms of codec. Tracks #2, #6 and #10 belong to the same alternategroup 45 with identifier equal to 200, For instance, track #2 and track#6 are alternate tracks with different resolutions, and track #10 is analternate track to track #2 and track #6 in terms of frame rate, etc . .. , and so on.

The track groups 41 and 42 have the same source_id 400 and the trackgroup 43 has a different source_id 401 meaning that sub-picture tracksbelonging to track groups 41 and 42 can be combined together (withrespect to other constraints, i.e almost one sub-picture track peralternate groups). On contrary, sub-picture tracks from track group 43are not intended to be combined with any sub-picture tracks from trackgroups 41 and 42 despite they may belong to a same alternate groupbecause they do not have the same source_id. The source_id parameterthen provides an indication to the players on the sub-picture tracksthat can be part of a same spatial composition. For a given spatialposition, one sub-picture track can be considered visually equivalent toanother sub-picture track at the same given spatial position. This isuseful for (sub-picture) track selection when the media content isprovided into multiple tracks. Moreover, it allows dynamic adaptation(in quality/bitrate or resolution) to display a same spatialcomposition, depending on the selected the sub-picture tracks.

FIG. 4 b illustrates, according to a second aspect of the invention, analternative way to indicate that groups are equivalent groups.

According to an embodiment, it may comprises an indication which isdirectly in the description of the track group and does no more rely onalternate groups or flags in the track header box. This alternative isuseful when the source_id is not present or when there is no trackselection box in the media file, so that the players determinealternative tracks when composing the image to display. In thisembodiment, the descriptive data about track grouping, called here‘TrackGroupTypeBox’, and in particular the descriptive data for 2Dspatial relationship description for instance‘SpatialRelationship2DsourceBox’ 410, is amended compared to knownsolutions, as illustrated by reference 411 in FIG. 4 b . An additionalparameter 413, called here equivalent_group_ID□, provides the list ofequivalent track groups for this track group. It is described as a listof track_group_id (for example the track_group_id declared in theTrackGroupTypeBox). The FIG. 4 b allows a backward compatibility withinitial version of the TrackGroupingTypeBox for 2D spatial relationshipdescription. Preferably, the additional parameter 413 for equivalentgroup signaling is present only

-   -   when the amended version of the box (not illustrated) is used,        or    -   in the known TrackGroupTypeBox, conditionally to a value of the        flags parameter (illustrated 414) value of the TrackGroupTypeBox        of type ‘2dcc’.

For example, the 24-bit integer flag 414 has the following value beingdefined:

-   -   “track_group_equivalence”: indicates that this track group has        equivalent track groups, meaning that tracks with same        properties in this track group and equivalent ones are        interchangeable or switchable. Flag value is for example        0x000002 (a reserved 24-bit value, not conflicting with other        reserved values for the flags parameter of the track group type        box).

As mentioned above, instead of using a reserved value for the flagsparameter, the indication of equivalent group may be conditioned to anew version of the structure providing the description of the trackgroup, i.e. the TrackGroupTypeBox, as follows:

aligned(8) class SpatialRelationship2DDescriptionBox extendsTrackGroupTypeBox(‘2dcc’, version, 0) {  // track_group_id is inheritedfrom TrackGroupTypeBox;  SpatialRelationship2DSourceBox( ); //mandatory, must be first  SubPictureRegionBox ( );  // optional  if(version == 1) {   GroupEquivalenceBox( );  } }

With GroupEquivalenceBox being defined as a FullBox:

aligned(8) class GroupEquivalenceBox extends TrackGroupTypeBox(‘grev’) { // track_group_id is inherited from TrackGroupTypeBox;  unsigned int(32) track_group_IDs[ ]; }

Where the track_group_IDs parameter provides a list of track_group_idvalues identifying track groups that contain tracks “equivalent” to thetracks of this track group. In the example above, the list of equivalenttrack groups is provided as a new box in the new version of the trackgroup type box. Alternatively, it may be provided as a new parameter ofthe ‘2dsr’ box, more generally in the box providing the groupproperties, as follows:

aligned(8) class SpatialRelationship2DSourceBox  extends FullBox(‘2dsr’,version, 0) {  unsigned int(32) total_width;  unsigned int(32)total_height;  unsigned int(32) source_id;  if (version == 1) {  unsigned int (32) equivalent_group_ID[ ]  } }

When, instead of using the version parameter, the flags parameter isused, the description of the group properties for 2D spatialrealtionshipd, ‘2dsr’ box 411, would become:

aligned(8) class SpatialRelationship2DSourceBox  extends FullBox(‘2dsr’,version, flags) {  unsigned int(32) total_width;  unsigned int(32)total_height;  unsigned int(32) source_id;  if ( (flags&0x02) == 1) {  unsigned int (32) equivalent_group_ID[ ]  } }

The declaration of equivalent track groups is not limited to ‘2dcc’track group type. Indeed, as soon as a track group contains tracks thatmay be interchangeable with other tracks in other track groups with thesame track_group_type, a list of equivalent track groups may be providedin the track group declaration. The matching of each track inside theequivalent track groups is computed by comparing the track properties.For example, in the case of track groups for 2D spatial relationships,any track having the same object_x, object_y, object_width andobject_height as another track in one of the equivalent track group canbe considered as an interchangeable tracks. It can be for example whenencoding with HEVC and independent tiles, sub-picture trackscorresponding to a same tile (same position) in different encodingconfiguration like quality or bitrate. It can also corresponding tosub-picture tracks from independent bitstreams (e.g. AVC, HEVC . . . )that could be composed together to reconstruct a given source.

As an alternative embodiment for indication of equivalent groups, theequivalence may be signalled within the track properties with respect toits track group. Indeed, a description of a track group, i.e. aTrackGroupTypeBox, may contain a structure (an ISOBMFF box or FullBox)declaring group properties (example the ‘2dsr’ for the ‘2dcc’ trackgroup type 411) and one or more boxes declaring track properties withinthe track group (example the ‘sprg’ for ‘2dcc’ track group type 412).The embodiment illustrated in FIG. 4 b suggested declaration ofequivalent track groups in the box for group properties 411, thusrequiring a parser to compute within each equivalent track group thematching between tracks. An alternative embodiment avoiding thiscomputation consists in declaring as part of the track properties withinthe track group (for example 412) the equivalence for each track of thetrack group. For example, when used in the track group for the 2Dspatial relationship, the ‘sprg’ box 412 then becomes:

aligned(8) class SubPictureRegionBox extends FullBox(‘sprg’,version,0) { unsigned int(16) object_x;  unsigned int(16) object_y;  unsignedint(16) object_width;  unsigned int(16) object_height;  if (version== 1) {  unsigned int(32) equivalent_track_IDs[ ];  } }

where the equivalent_track_IDs parameter provides the list of track_ID(for the track identifier declared in the track header box) for thetracks that can be considered as equivalent to the current trackpertaining to this track group. When, instead of using the versionparameter, the flags parameter is used, the ‘sprg’ box would become:

aligned(8) class SubPictureRegionBox extendsFullBox(‘sprg’,version,flags) {  unsigned int(16) object_x;  unsignedint(16) object_y;  unsigned int(16) object_width;  unsigned int(16)object_height;  if ( (flags & 0x02) == 1) {  unsigned int(32)equivalent_track_IDs[ ];  } }

Having the list of equivalent track groups inside each track groupdeclaration may be costly in terms of bytes. Indeed, track groupdeclaration occurs in each track of the track group. When there are manyequivalent groups, the list of track group IDs description is thenrepeated in each track of each equivalent track group.

An embodiment providing more compact description consists in defining ina single place the equivalence between track groups.

FIG. 5 illustrates another embodiment where the indication of equivalenttrack groups is provided outside track declaration for compactness ofthe description. Indeed, when the indication of equivalent track groupsis provided at track level, for example in the description of the trackgroups, it is duplicated in each track of the track group. Having thisdeclaration at top-level of the media file, for example under the ‘moov’box, allows a single declaration and rapid access to this information byparsers. In FIG. 5 , the encapsulated media file 420 contains threetrack groups: 421, 422, 423, respectively with a track grouping type andtrack_group_id (#11, #12 and #13). Each track group contains more orless tracks identified by their track_IDs. The track groups 421 and 422are equivalent as represented by 425. A dedicated box 424 is used todeclare this equivalence.

In case the list of track groups is declared using the entity groupingmechanism of ISOBMFF, (i.e. in a GroupListBox), the indication ofequivalent track groups is declared with the entity grouping mechanism,for example inside aGroupListBox as an additional descriptor. Forexample, the descriptor 424 is declared in the ISOBMFF as a structure(or box) providing the lists of equivalence groups, for example aGroupEquivalenceBox (name here is just an example). In the example ofFIG. 5, 424 provided as a GroupEquivalenceBox would declare the list:#11, #12 to indicate parsers or players that each track in these trackgroups are equivalent: track #1 and track #4 may be equivalent, as wellas #2 and #6, #3 and #7 and #4 with #8. Optionally, theGroupEquivalenceBox may contain an additional field or parameterproviding the type of equivalence. The possible values for thisparameter is a pre-defined or registered list of values like forexample: “bitstream_equivalence” meaning that tracks are interchangeablewhen parser is doing sample reconstruction (either implicit orexplicit).

Another example of value for the additional field or parameter providingthe type of equivalence is another pre-defined or registered value likefor example “display_equivalence” meaning that the pictures orsub-pictures resulting from the decoding of these tracks are visuallyequivalent. For example, in the case of sub-picture tracks, one trackfrom track group #11 may be used with other tracks in track group #12(or the reverse) to compose and to reconstruct the initial image thatwas split. Alternatively, instead of describing the indication ofequivalent track groups 424 as a GroupEquivalenceBox the indication ofequivalent track groups 424 may be provided as one EntityToGroupBox. Forexample the structure 424 is an EntityToGroupBox with a dedicatedgrouping_type equal to ‘tgeq’ for track group equivalence, indicatingtwo entities in the group: the track groups #11 and #12 (as entity_idvalues). A dedicated grouping type is preferred instead of using theexisting ‘eqiv’ from ISO/IEC 23008-12. This is because the existing‘eqiv’ grouping type in EntityToGroupBox, when applied to samples of atrack indicates that samples are equivalent to each other inside a sametrack and potentially with samples in another track or items listed inthe EntityToGroupBox. This latter approach also applies when the trackgroups are declared in a TrackGroupBox ‘trgr’ of each track. Thedescriptor or structure for track group equivalence 424 may be storedunder a ‘meta’ box of the media file. It can be for example under themoov/meta box or in meta box at top level of the file.

The descriptor or structure for track group equivalence 424 may alsoprovide the equivalence at track level: in this case, the grouping typevalue is another reserved code for track equivalence signaling: ‘trey’.The entity_ID provided in the structure 424 are then track_IDs. Thisrequires as many EntityToGroupBox(es) with the grouping_type for trackequivalence signaling (e.g. ‘trey’) as there are track to trackassociations to declare. In the example of FIG. 5 , there may be oneEntityToGroup with grouping_type ‘trey’ to declare track #1 and track #5as equivalent tracks, one for track #2 and track #6, one for track #3and track #7 and a last one for track #4 with track #8.

As another alternative to the embodiment illustrated on FIG. 4 b , theindication of equivalent track groups 424 uses the existing trackreference mechanism. However, there is no track reference type availableand dedicated to “equivalence” indication. Then it is proposed to definenew track reference types ‘beqt’ and ‘deqt’ that when used between twotracks respectively indicate that the tracks are equivalent orswitchable in terms of bitstream (interchangeable sub-bitstreams duringbitstream concatenation or sample reconstruction process) or equivalentin terms of display (i.e. displaying the same content, but potentiallyat different quality or resolution). While the former allowscombination/track replacement in the compressed domain, the latterallows combination/track replacement only after decoding, i.e. in thepixel domain.

The track reference mechanism as defined in ISOBMFF can be extended toalso described association between track groups. The current trackreference box in the ISOBMFF hierarchy of boxes can only be declaredunder a ‘trak’ box. In an embodiment of the invention it is proposed toallow track reference in track group box as well, so that a group oftracks (in the group) can be directly associated to another group oftracks:

Box Type: ‘tref’ Container: TrackBox or TrackGroupBox Mandatory: NoQuantity: Zero or onewith the following semantics:when used in TrackGroupBox, the track_IDs is an array of integersproviding track group identifiers (track_group_id from aTrackGroupTypeBox) of the referenced track groups. The list of possiblevalues to use for the reference type is extended with an ‘eqiv’ value asfollows:‘eqiv’: this track group contains tracks that each has an equivalenttrack in the referenced track group(s). It is up to a parser, dependingon the track grouping type and the track properties within the trackgroup to determine which track from the referenced track groupcorrespond to a given track in this track group. For example, in thecase of sub-picture tracks, tracks at same position with same size canbe considered equivalent. As explained below by reference to FIG. 6 the‘trgr’ boxes 601 and 602 could be associated through this ‘tref’ attrack group level. As for alternative embodiments for declaration ofequivalent track groups, the track reference type may be more precise interms of description. Instead of defining a single ‘eqiv’ trackreference type, two new track reference types may be used: one forbitstream equivalence (for example: ‘beqv’) and another one for displayequivalence (for example: ‘deqv’).

FIG. 13 , comprising FIGS. 13 a and 13 b , illustrate explicitreconstruction from alternative sets of sub-picture tracks. Thisinvention proposes a new kind of Extractor NAL unit to use in Extractortrack like 1400 in FIG. 13 a or 1450 on FIG. 13 b . The ISO/IEC 14496-15defines Extractor for different compression formats: SVC, MVC, HEVC . .. HEVC extractors introduce specific constructors to reconstruct asample from a referenced track or from data provided within theconstructor. We propose a new kind of constructor that we can callSampleConstructorWithAlternatives that extend the HEVC and L-HEVCextractors (or any compression format reusing the concept ofconstructors inside an Extractor) as follows:

class aligned(8) Extractor ( ) {  NALUnitHeader( );  do {   unsignedint(8) constructor_type;   if( constructor_type == 0 )   SampleConstructor( );   else if( constructor_type == 2 )   InlineConstructor( );   else if ( constructor type == 4)    SampleConstructorWithAlternatives( );  } while( !EndOfNALUnit( ) ) } The semantics of Extractor::constructor_type is updated as follows: constructor_type specifies the constructor that  follows.SampleConstructor,   InlineConstructor andSampleConstructorWithAlternatives   correspond to constructor_type equalto 0, 2 and 4,   respectively. Other values of constructor_type arereserved.A new section defining the new extractor is added in the Annex of theISO/IEC 14496-15 for this new constructor to become interoperablebetween File/segment encapsulation means 150 (e.g. mp4 writers) andFile/segment de-encapsulating means 171 (e.g. mp4 readers):The new sample constructor with alternatives is defined as follows:

Syntax

class aligned(8) SampleConstructorWithAlternatives ( ) {  unsignedint(8) ref_index; // can be a track or a track_group index signed int(8) sample_offset;  unsigned int((lengthSizeMinusOne+1)*8)data_offset;  unsigned int((lengthSizeMinusOne+1)*8) data_length; }with the following semanticsThe ref_index specifies the index of the track reference of type ‘sca1’(or a dedicated track reference type for bitstream equivalence like 1401or 1451) to use to find the track or the track group containing a trackfrom which to extract data.

-   -   sample_offset: as specified in Error! Reference source not        found.        -   data_offset: The offset of the first byte within the            reference sample to copy. If the extraction starts with the            first byte of data in that sample, the offset takes the            value 0.        -   data_length: as specified in A.7.4.1.2 of DCOR3            Such extractor with specific constructor can be used in the            encapsulation step from FIG. 2 a , step 2242 or FIG. 2 b            step 242.

FIG. 14 illustrates the extractor resolution by a File/segmentde-encapsulation means 171 according to the invention, for example withan ISOBMFF parser. While reconstructing samples, the de-encapsulationmeans read NAL units from the media part of the file. It checks the NALunit type in step 1500. If it corresponds to a NALU type for anExtractor (test 1501 true) it gets the ref_index parameter in 1502. Whenthe ref_index resolves to a track_ID, (test 1503 true) the ISOBMFFparser identifies the referenced sample in 1504 referenced by theextractor potentially considering the sample_offset given in theconstructor and sample description information. It then reads the NALunit in 1505 and extracts the NAL unit payload in 1506 to append it tothe reconstructed bitstream in 1507 resulting from the parsing and toprovide to the decoding means 172. When the ref_index resolves to atrack group_id (1503 false), it is up to the ISOBMFF parser to selectthe most appropriate track in the corresponding track group as indicatedby track selection in FIG. 13 a . This is done in step 1508. A defaultbehavior is to select the first track in the file having the track groupid declared in one of its track groups. If the indication of track groupequivalence contains an indication of differentiating attributes (forexample reusing an attribute list like in track selection box), thisinformation may be used by a media player to select one track in thelist of track pertaining to the track group with the referencedtrack_group_id. Once the track_group_id has been translated as atrack_ID, the ISOBMFF parser follows the steps 1504 to 1508 and keeps onprocessing the NAL units until the end.

To avoid potential conflicts between track_ID and track_group_id, it isrecommended that media files containing this constructor contain intheir list of compatible brands a brand indicating that track_ID,track_group_ID, EntityToGroup::group_id shall be unique identifiers.Note: the above requirement on the brand is simpler than going throughall the track groups to check whether the (flags & 1) is true.This new extractor could reuse ‘scal’ track reference but this wouldrequire an amendment of several parts of ISO/IEC 14496 Part-15.Probably, having dedicated track references indicating “explicit spatialreconstruction with alternatives” (‘esra’, like in 1401 and 1451)instead of ‘scal’ would have the benefit of indicating the use of thespecific extractor.FIG. 13 b proposes a more compact solution (than FIG. 13 a ) for thedescription of interchangeable, mergable or switchable sub-picturetracks than defining both ‘2dcc’ and ‘alte’ track groups. Error!Reference source not found. illustrates how to take benefit of the trackgroup for spatial relationship description to record track equivalencein the meantime. For that, each sub-picture track indicated aspertaining to a ‘2dcc’ track group is also indicated as pertaining to asubset. Tracks inside a same subset are then considered as alternatives,interchangeable or switchable bitstreams that can be used duringbitstream concatenation by ISOBMFF parsers. The subset is identified bya unique identifier that can be used as ref_index in the specificSampleConstructorWithAlternatives (track with ID #100). The subsetidentifier can be declared as a parameter in the ‘2dsr’ box. The use ofa subset_identifier costs 4 bytes per sub-picture track where thedeclaration of ‘alte’ track costs 24 bytes per sub-picture track.Moreover, this reduces the number of track groups to parse.In embodiment of FIG. 13 b , the description of the properties of the 2Dspatial relationship group (the ‘2dsr’ box) is extended to support thedeclaration of subsets, as follows:

aligned(8) class SpatialRelationship2DSourceBox  extends FullBox(‘2dsr’,0, 0) {  unsigned int(32) total_width;  unsigned int(32) total_height; unsigned int(32) source_id;  unsigned int (32) subset_id; }Where subset_id is an identifier for set of sub-picture tracks at a samespatial position and that are equivalent or switchable in terms ofbitstream. This means that during bitstream concatenation, the bytes forone sample of any one of the equivalent tracks in a subset may be usedinstead of the bytes for the same sample of any other equivalent tracksin the same subset.Alternatively, the subset_id may be defined in a set of parametersdescribing the properties of the track within the track group, forexample the ‘sprg’ box in case of ‘2dcc’ track grouping_type.When using the compact description of FIG. 13 b , the semantic ofref_index in SampleConstructorWithAlternatives 1452 and 1453 is changedas follows to allow referencing subset identifiers sub-picture subsets:ref_index specifies the index of the track reference of type ‘sca1’ (or‘esra’ like 1401 or 1451) to use to find the track, the track group orthe subset of a track group containing a track from which to extractdata. When the ref_index resolves to a track group_id or to a subset_idof a track group, it is up to the parser or player to select the mostappropriate track in the corresponding track group or subset of a trackgroup. A default behavior is to select the first track in the filehaving the track group id or subset_id.To avoid potential conflicts between track_ID, subset_id andtrack_group_id, it is recommended that media files containing thisconstructor contain in their list of compatible brands a brandindicating that track_ID, track_group_ID, EntityToGroup::group_id, andsubset_id shall be unique identifiers.The same mechanism can be extended to implicit reconstruction, i.e. whenthe reconstruction rule is defined at track level and no more at samplelevel with extractors. A specific track reference type for “implicitreconstruction with alternatives” is defined (for example ‘isra’). Incase a same tile base track has alternative tile tracks forreconstruction, this specific track reference is used to associate thetile base track to the track group id or the subset_id describingalternative tile tracks. Then a parser processing such file will have anintermediate step of translating the track_reference to the trackgroup_id or subset_id into a track_ID. It can be selection of the firsttrack found having the referenced track_group_id or subset_id orselection based on additional properties associated to alternativesub-picture tracks (like differentiating attributes directly describedin the track properties within the track group, like for example the‘sprg’ box.

FIG. 6 illustrates a second example of use of theSpatialRelationship2DdescriptionBox and the indication of groupequivalence 603 according to embodiments of the invention. The samevideo source 600 (e.g. the same projected video source) is used togenerate two alternative versions, in terms of quality (@quality1 and@quality2). There are two sets of sub-picture tracks: one for the highquality (quality 1) 610 and one for the low quality (quality 2) 620.

The corresponding sub-picture tracks can be described as on the rightpart of the FIG. 6 (in the ‘trak’ box hierarchy 611 and 621). Both trackgroups have the same source_id, and same total_width and total_heightcorresponding to the resolution of each set of sub-picture tracks.Sub-picture track coordinates (object_x, object_y, object_width,object_height) describe the sub-picture track's spatial relationship orposition within their respective track group composition. Again, as bothtrack groups have the same source_id, this means that they represent thesame source and sub-picture tracks from the first track group 601 (withtrack_group_id equal to 10) can be combined with sub-picture tracks fromthe same track group but also with sub-picture tracks from the secondtrack group 602 (with track_group_id equal to 20) with respect to theirrespective position in their respective composition.

According to this example, the composition picture represented by thetrack group 601 with track_group_id equals to 10 can be composed byselecting one sub-picture from the alternate group 602 as indicated bythe dedicated track reference 603

On contrary to two-dimensional (2D) video content, OMAF media contentrepresents an omnidirectional media content illustrating the usersviewing perspective from the centre of a sphere looking outward towardsthe inside surface of the sphere. This 360° media content is thenprojected in a two-dimensional plane by applying a video projectionformat. Then, optionally, region wise packing is applied to reorganizeregions from the projected picture into packed regions. A 360° mediacontent may also be represented by several circular images captured witha fisheye lens (wide-angle camera lens).

Thus, in the context of OMAF, a 2D picture (resulting from thereconstruction of sub-picture tracks) may be either a projected pictureor a packed picture and sub-picture tracks may contain different kind ofcontents:

-   -   sub parts of a projected picture (with no packing),    -   sub parts of a frame-packed picture, for example when the        content is stereoscopic,    -   sub parts of a projected and packed picture, or    -   sub parts of a fisheye coded picture.

According to a third aspect of the invention, the definition ofSpatialRelationship2DdescriptionBox is improved to indicate whether thesize and position coordinates of sub-picture tracks containing OMAFmedia content are relative to the projected picture, to the packedpicture, or to another picture. The third aspect may be combined withboth first and second aspects.

In one embodiment, SpatialRelationship2DdescriptionBox is defined sothat the size and position coordinates of sub-picture tracks containingOMAF media content are always relative to the packed picture. When thereis no packing, the packed picture is equals to the projected picture.

In another embodiment, SpatialRelationship2DdescriptionBox is defined sothat the size and position coordinates of sub-picture tracks containingOMAF media content are relative to the projected picture or to thepacked picture or any intermediate picture in the processing stepsbetween the capturing step 110 and the encoding step 140. In particular,in the case of the application format for omnidirectional media (OMAF),it is not clear whether positions and sizes expressed in the 2D spatialrelationships refer to the projected or to the packed picture.

In one embodiment, the SpatialRelationship2DdescriptionBox is alwaysrelative to the packed picture. When there is no packing, the packedpicture is the same as the projected picture.

In another embodiment, a preferred approach is to define thatSpatialRelationship2DdescriptionBox is always relative to the projectedpicture.

The method for encapsulating encoded media data corresponding to a wideview of a scene may comprise in some embodiments the following steps:

-   -   obtaining a projected picture from the wide view of the scene;    -   packing the obtained projected picture in at least one packed        picture;    -   splitting the at least one packed picture into at least one        sub-picture;    -   encoding the at least one sub-picture into a plurality of        tracks;    -   generating descriptive metadata associated the encoded tracks,        wherein the descriptive metadata comprise an item of information        associated with each track being indicative of a spatial        relationship between the at least one sub-picture encoded in the        track and the at least one projected picture.

Accordingly, no particular signalling of the reference picture isneeded. The reference picture is defined to be the projected pictureeven if the sub-picture are obtained by splitting the packed picture.

The method for encapsulating encoded media data corresponding to a wideview of a scene may comprise in some embodiments the following steps:

-   -   obtaining a projected picture from the wide view of the scene;    -   splitting the projected picture into at least one sub-picture;    -   encoding the at least one sub-picture into a plurality of        tracks;    -   generating descriptive metadata associated with the encoded        tracks, the descriptive metadata comprise a first item of        information associated with each track being indicative of a        spatial relationship between the at least one sub-picture        encoded in the track and a reference picture;        wherein the descriptive metadata further comprises a second item        of information indicating the reference picture.

Accordingly, by specifying the reference picture in the metadata, it ispossible to generate sub-picture data related to any of the projectedpicture, the packed picture or any other reference picture independentlyof the splitting operation.

The table below proposes a practical mapping of theSpatialRelationship2DdescriptionBox track group sizes and coordinatesattributes relative to the projected picture in the context of OMAF forsub-picture tracks containing either projected, for example usingEquirectangular (ERP) or cubemap projections, packed or fisheyecontents. In the table below, “rwpk” is a shortcut for the region-wisepacking structure, i.e. a structure that specifies the mapping betweenpacked regions and the respective projected regions and specifies thelocation and size of the guard bands, if any. As well, ‘fovi’ is ashortcut for the FisheyeVideoEssentialInfoStruct, a structure thatdescribes parameters for enabling stitching and rendering of fisheyeimages at the OMAF player.

Type of the source total_width/total_height object_width/object_heightobject_x/object_y Projected Shall be equal to the Shall be equal towidth Shall be equal to picture luma picture size of the and height ofthe the x, y coordinate of (no projected picture projected region thetop-left corner of packing) represented by the sub- the projected regionpicture track's samples represented by the (In such a case, shall besub-picture track's equal to width and samples within the heightdeclared in the projected picture track header box of the sub-picturetrack) Projected Shall be equal to the Shall be equal to width Shall beequal to and luma picture size of the and height of the the x, ycoordinate of packed projected picture projected region the top-leftcorner of picture Total_width = represented by the projected regionrwpk@proj_picture_width unpacking the sub- represented by Total_height =picture track's samples unpacking the sub- rwpk@proj_picture_height (Insuch a case, the picture track's projected region samples within theresulting from the projected picture unpacking of the sub- picture trackmay contain gaps) Fisheye Shall be equal to the Shall be equal to theShall be equal to projected luma picture size of the width and height ofthe the coordinates of picture projected image, i.e. the rectangularprojected the top-left corner of image including all region thatcontains the the rectangular circular images. one or more circularprojected region Total_width = images from the sub- that contains thefovi@rect_region_left + picture track. one or more circularfovi@rect_region_width E.g. in case a sub- images from the of the lastcircular picture track contains sub-picture track image. only onecircular image: E.g. in case only Total_height = Object_width = onecircular image: fovi@rect_region_top + fovi@rect_region_width Object_x =fovi@rect_region_height Object_height = fovi@rect_region_left of thelast fovi@rect_region_height Object_y = circular image In case thesub-picture fovi@rect_region_top track contains more In case the sub-than one circular image, picture track the object_width (resp. containsmore than object_height) is equal one circular image, to the sum of thethe object_x (resp. fovi@rect_region_width object_y) is equal to the(resp. the sum of the fovi@rect_region_left fovi@rect_region_height)(resp. the of the contained fovi@rect_region_top) circular images of themost top- left circular image in the list of contained circular images.

Defining SpatialRelationship2DdescriptionBox attributes as relative tothe projected picture provides an advantage to the application comparedto defining them as relative to the packed picture. Indeed, in case ofviewport-dependent streaming, the application may only want to downloadsub-picture tracks corresponding to current user's viewport (i.e.corresponding to user's field of view and orientation). If theSpatialRelationship2DdescriptionBox attributes are defined as relativeto the projected picture, the application can directly use thisinformation from the SpatialRelationship2DdescriptionBox track group toselect appropriate sub-picture tracks while it is moving inside theprojected picture. Otherwise, the application needs to parse, inaddition to track group information, the region-wise packing informationlocated in the VisualSampleEntry to convert sub-picture packed contentinto the projected picture before being able to select appropriatesub-picture tracks.

Optionally, the track group describing spatial relationship (e.g. the‘2dcc’ track group) may contain an additional descriptor providing, fora given sub-picture track, its mapping to the 360° sphere. Thisadditional descriptor provides without any computation for the mediaplayer the mapping between the 2D video sub-picture track and a 3Dviewport, so that selection by the player of the relevant track or setof tracks corresponding to a given user's viewing direction is easier.The track group describing the spatial relationships then rewrites asfollows:

 aligned(8) class SpatialRelationship2DDescriptionBox extendsTrackGroupTypeBox(‘2dcc’) {  // track_group_id is inherited fromTrackGroupTypeBox;  SpatialRelationship2DSourceBox( ); // mandatory,must be first  SubPictureRegionBox ( );   // optional SphericalRegionBox ( );  // optional  }

Where the SpatialRelationship2DSourceBox and SubPictureRegionBoxrespectively describe the 2D coordinate system of the sub-picture trackspertaining to the track group and their positions and sizes;

Where SphericalRegionBox is a new box defined as follows (thefour-character code is just an example, any four-character code may beused, provided it is reserved for the indication of spherical region):

aligned(8) class SphericalRegionBox extends FullBox(‘sspr’, 0, 0) { SphereRegionStruct(1); }

Where the SphereRegionStruct specifies a sphere region as a triplet(centre_azimuth, center_elevation, center_pitch) or sometimes (yaw,pitch, roll) with ranges for the azimuth (vertical) and elevation(horizontal) dimensions).

FIG. 7 illustrates the sub-picture encapsulation. performed by means 250of FIG. 1 a and the optional means 260 and 280 and 285. In step 701, theuser configures the encapsulation module (for example an ISOBMFF writeror mp4 packager or writer in means 150 on FIG. 1 a ). This can be donethrough a graphical user interface controlling an encapsulationsoftware. This consists in specifying information on the source toencapsulate or parameters for the encapsulation like decomposition intosub-picture tracks for example, or generation of one single media fileor many segment files. Alternatively, this can be pre-registered assettings in the recording device capturing the scene (camera, networkcamera, smartphone . . . ). Then, the encapsulation module initializesthe reference picture in step 702 as the captured image. This consistsin storing in RAM of the device running the encapsulation module thesizes of the captured image. Next, at step 703, the encapsulation modulechecks whether the encapsulation configuration contains a projectionstep. If false, next step is 706. For example, when captured content is360° content, it can be projected onto a 2D image, called the projectedpicture. If a projection is in use (test 703 true) then theencapsulation module inserts (step 704) the description of theprojection in use in the descriptive metadata of the media file (ormedia segments). This can be for example a Projected omnidirectionalvideo box ‘povd’ according to OMAF specification. Then (step 705), thereference picture is set to projected picture. This means for examplethat the sizes of this projected picture are stored in memory. The step706 consists in checking if the captured source is stereoscopic or notand whether the views are packed into a single frame. If the test 706 istrue, then the encapsulation module inserts (step 707) in the media filea descriptor for stereo content. In case of OMAF or ISOBMFF it is aStereoVideoBox. If the test 706 is false, next step is 709. Followingstep 707, the frame-packed picture is stored in memory at the referencepicture. The test 709 consists in checking whether the encapsulationconfiguration indicates that the projected and optionally frame-packedpicture needs to be further rearranged into packed regions. If test 709is true, the encapsulation module inserts (step 710) the description ofthis packing into regions (equivalent to the optional step 260 of FIG. 1). In the case of OMAF, it can be a RegionWisePackingBox identified bythe ‘rwpk’ box type. Then in 711, the reference picture is set to thepacked picture. If test 709 is false, the next step is 712. The test inStep 712 consists in checking the encapsulation configuration: whetherimplicit signaling or explicit signaling for sub-picture tracks ischosen or set by the user or the application. If the implicit signalingis off, then at step 713, the encapsulation module inserts descriptivemetadata providing which reference picture is used for sub-picture trackgeneration (i.e. the picture that has been split into spatial parts,each encapsulated in sub-picture tracks). If the implicit signaling ison, then next step is 714. At step 714, the encapsulation module insertsa track group describing the spatial relationships among the differentspatial parts of the split picture. In particular, the size of theresulting composition of the sub-picture tracks is set to the size ofthe reference picture stored in memory (in 702, 705, 708 or 711). Thiscan be for example the total_width and total_height parameters in theSpatialRelationship2DSourceBox. Finally, at step 715, the encapsulationmodule describes each sub-picture track in terms of positions and sizesin the reference picture. This consists for example in OMAF or ISOBMFFto put the values resulting from the split into the parameters of theSubPictureRegionBox, when these parameters are static, or in the samplegroup description box for spatial relationship description (for examplethe SpatialRelationship2DGroupEntry box).

The explicit signaling of step 713 can be done in various ways asdescribed along with the description of the parsing process asillustrated by FIG. 8 .

The method for generating at least one image from a media filecomprising a plurality of encoded tracks and associated descriptivemetadata may comprise in some embodiments:

-   -   determining that the plurality of encoded tracks comprise a        group of tracks encoding at least one sub-picture resulting from        the splitting of a packed picture obtained by packing a        projected picture of a wide view of a scene;    -   parsing descriptive metadata associated with the group of        tracks;    -   wherein parsing descriptive metadata associated with the group        of tracks comprises:    -   interpreting an item of information associated with each track        being indicative of a spatial relationship between the at least        one sub-picture encoded in the track and the at least one        projected picture.

The method for generating at least one image from a media filecomprising a plurality of encoded tracks and associated descriptivemetadata may comprise in some embodiments:

-   -   determining that the plurality of encoded tracks comprise a        group of tracks encoding at least one sub-picture resulting from        the splitting of a projected picture of a wide view of a scene;    -   parsing descriptive metadata associated with the group of        tracks;    -   wherein parsing descriptive metadata associated with the group        of tracks comprises:    -   interpreting a first item of information associated with each        track being indicative of a spatial relationship between the at        least one sub-picture encoded in the track and the at least one        reference picture; and    -   interpreting a second item of information indicating the        reference picture.

The media player, using an ISOBMFF parser, receives the OMAF file in801. It identifies the different tracks present in the media file and inparticular the video tracks. For those video tracks, the parser checkswhether these are classical 2D videos or video tracks foromnidirectional media that have been projected onto a 2D picture. Thisis determined by looking at the major brand or in the list of compatiblebrands in the ‘ftyp’ box in step 802. For example, a brand set to ‘ovdp’indicates that the media file contains a VR experience using thetechnologies for the OMAF viewport-dependent baseline presentationprofile. This invention proposes in an embodiment to define an explicitbrand (as major_brand value or to be put in the list of compatiblebrands) indicating that the VR experience according to an OMAFviewport-dependent profile further uses sub-picture tracks. At least twospecific values for brands (major or compatible) may be defined:

A first value may be defined, for example named ‘odpr’, foromnidirectional dependent profile. This value indicates that theomnidirectional media is split into sub-picture tracks referencing theprojected picture. Any ISOBMFF parser or OMAF player compliant to thisbrand shall interpret sub picture tracks positions as positions in theprojected picture. As well, the total_width and total_height shall berespectively interpreted as the width and height of the projectedpicture.

A second value may be defined, for example named ‘odpa’, foromnidirectional dependent profile. This value indicates that theomnidirectional media is split into sub-picture tracks referencing thepacked picture. Any ISOBMFF parser or OMAF player compliant to thisbrand shall interpret sub picture tracks positions as positions in thepacked picture. As well, the total_width and total_height shall berespectively interpreted as the width and height of the packed picture.

When one of this brand is present, the OMAF player or media playerimmediately identifies how to get the reference picture information. Itthen parses the explicit track group for spatial relationshipdescription that contains an indication of the reference picture. Thisis done at step 803.

When none of these brands is present in the ‘ftyp’ box, the media fileparser or media player has to further parse the media file to determinethe presence of sub-picture tracks and whether they reference projectedor packed picture (object of test 802). If the track groups describingspatial relationship are explicit tracks groups according to embodimentsof this invention, then the parser parses in 803 these explicit trackgroups. It determines at step 804 the reference picture in use todescribe the sub picture tracks in a given track group (identifiedthrough the track_group_id for example). This has to be taken intoaccount when presenting sub picture tracks to the user for selection orwhen rendering the sub picture tracks. Additional transformation may berequired to generate the image from the sub picture track expressed inthe reference picture to the captured picture. For example when thereference picture is the packed picture, to be expressed in theprojected picture, the sub-picture track positions and sizes have to beunpacked. This processing is the object of step 812. We now explain howexplicit signaling is performed during encapsulation step 713 to be usedby parser in step 803.

In alternative embodiments to the new brands, it is proposed to add anexplicit signaling at the track or track group level. This may be doneusing the ‘2dcc’ track group for 2D spatial relationship description inISOBMFF. This additional signaling can help parsers or players to handlesub-picture tracks, in particular to determine whether they expressposition and sizes for the projected picture or for the packed picture.

One embodiment for such signaling may be to define a new parameter inthe specific track group type box for the spatial relationshipdescription. Preferably it is defined in the mandatory part of the trackgroup box, namely the SpatialRelationship2DSourceBox, for spatialrelationship description, so that a parser can obtain the information.

An example of this embodiment may be:

aligned(8) class SpatialRelationship2DDescriptionBox extendsTrackGroupTypeBox(‘2dcc’) {  // track_group_id is inherited fromTrackGroupTypeBox;  SpatialRelationship2DSourceBox( ); // mandatory,must be first  SubPictureRegionBox ( );   // optional } aligned(8) classSpatialRelationship2DSourceBox  extends FullBox(‘2dsr’, 0, 0) { unsigned int(32) total_width;  unsigned int(32) total_height;  unsignedint(32) source_id;  unsigned int(1) reference_picture;  unsigned int(31)reserved }where “reference picture” is a new parameter that when taking value “0”indicates that the positions for the sub picture tracks in this groupare expressed in the projected picture coordinate system. When takingvalue “1”, it indicates that sub picture tracks in this group areexpressed in the packed picture. The name given to this parameter is anexample. As well, the total_width and total_height respectively indicatethe width and the height of the projected picture.

To be more generic than simply supporting a choice of reference picturebetween the projected or the packed picture, the reference_picture maytake several values, the value corresponding to the intermediate pictureto use as reference between the capture and the encoding. For examplevalue 0 may be used for captured image (step 702) when there is noprojection, value 1 may be used when there is projection only (step705), value 2 for frame-packed picture (step 708) and value 3 for packedframe (711). This indication would require 2 bits compared to theprevious embodiment supporting only projected and packed frame.

Another embodiment, being more explicit signaling, consists in providinga 4cc code to describe the reference picture (instead of an integervalue). This would be more costly in terms of description (4 bytes persub-picture track). For example, to indicate that reference picture isthe projected picture, the reference picture value could be set to‘povd’. For the packed picture, it could be set to ‘rwpk’; forframe-packed picture, it could be ‘stvi’. For the captured image, thedefault case could be set to a dedicated four character code: ‘dflt’ for“default”, meaning the captured image. Preferably, a mapping between anintermediate picture and an integer code is defined and registered forexample by mp4 registration authority to have interoperable codes forthe reference picture value.

The additional reference_picture parameter may alternatively be declaredin the optional part of the SpatialRelationship2DDescriptionBox, namelythe SubPictureRegionBox. It may be preferable to have it in themandatory part when explicit signaling is decided in step 712. This isto make sure that the parser or player can find the information.

In another alternative embodiment, additional signaling in the specifictrack group type box for the spatial relationship description is definedin a way that it preserves backward compatibility with older versions ofspatial relationship description in ISOBMFF or OMAF. For that, a newversion of the TrackGroupTypeBox is defined, for example version=1 orthe same version=0 but with flags value. It is to be noted thatTrackGroupTypeBox in prior art does not allow flags value. Providing theTrackGroupTypeBox with flags value is part of this embodiment of theinvention.

A flag value “Reference_info_is_present” set for example to the value0x01, may be defined to indicate that this track group containsinformation on a reference picture to consider for position and sizes ofspatial relationship information. Then the 2dcc track group can beexpressed as follows:

 aligned(8) class SpatialRelationship2DDescriptionBox extendsTrackGroupTypeBox(‘2dcc’, 0, flags) {   // track_group_id is inheritedfrom TrackGroupTypeBox;   SpatialRelationship2DSourceBox(flags); //mandatory, must be first   SubPictureRegionBox ( ); // optional }aligned(8) class SpatialRelationship2DSourceBox  extends FullBox(‘2dsr’,0, flags) {   unsigned int(32) total_width;   unsigned int(32)total_height;   unsigned int(32) source_id;   if ( (flags & 0x01) == 1){    unsigned int(1) reference_picture;    unsigned int(31) reserved   }}

where reference_picture is a new parameter that when taking value “0”indicates that the positions for the sub picture tracks in this groupare expressed in the projected picture coordinate system. The name ofthe parameter is given as an example. As well, the total_width andtotal_height respectively indicate the width and the height of theprojected picture.

Using the flags reduces the description cost of each sub picture trackwhen there is no ambiguity on the reference picture, for example for a2D classical video. Using the flags to indicate the presence or absenceof a reference picture allows reusing the 2dcc track grouping type tohandle both cases of splitting an omnidirectional content intosub-picture tracks: with or without the region wise packing step.

In yet another embodiment, the flags parameter of theTrackGroupingTypeBox, or of one of its inheriting boxes likeSpatialRelationship2DDescriptionBox, is used to provide the referencepicture directly in the flags value. For example when the flagsparameter has the least significant bit set to 0, this means that thereference picture is the projected picture in case of omnidirectionalvideo. When the flags parameter has its least significant bit set to 1,then it means that the reference picture is the packed picture in caseof omnidirectional video. The default value is the least significant bitof the flags parameter set to 0. With this embodiment, there is noadditional parameter in the SpatialRelationship2DSourceBox, which makesthe file description more compact (saving 4 bytes per sub-picturetrack).

In an alternative embodiment, the distinction between implicit orexplicit sub-picture tracks signaling is done by using two differenttracks grouping types. The current grouping type is used for implicitsignaling, a new track grouping type is defined for explicit spatialrelationship track group. For example, the four-character code ‘edcc’ isused and a new TrackGroupingTypeBox is created as follows:

aligned(8) class ExplicitSpatialRelationship2DDescriptionBox extendsTrackGroupTypeBox(‘edcc’, 0, flags) {  // track_group_id is inheritedfrom TrackGroupTypeBox;  ExplicitSpatialRelationship2DSourceBox(flags); // mandatory, must be first  SubPictureRegionBox ( ); // optional }aligned(8) class ExplicitSpatialRelationship2DSourceBox   extendsFullBox(‘edsr’, 0, flags) {  unsigned int(32) total_width;  unsignedint(32) total_height;  unsigned int(32) source_id;  unsigned int(8)reference_picture; }

When the encapsulation configuration is determined to be “implicit”,(test 801 and 802 false) meaning that no specific signaling is used, theparser goes into implicit determination of the reference picture. Itconsists by parsing the schemes declared in the restricted informationbox ‘rinf’, which transformation or post-decoding operations have to beperformed and potentially provide reference picture. Most of the timefor OMAF, it can be a packed picture or the projected picture. Forstereoscopic content, it may also be the frame packed picture. Theparser then checks the presence of OMAF descriptors to determine thecandidate reference pictures. The parser assumes that the positions andsizes parameters for the spatial relationship description are expressedwith respect to the projected picture when there is no region-wisepacking indication in the media file (test 810 false). When aregion-wise packing box is present, the positions and sizes parametersfor the spatial relationship description are expressed with respect tothe packed picture (step 811). Optionally the parser may considerpresence or absence of the frame-packed picture by testing for thepresence of a ‘stvi’ box in the sub-picture tracks of the track groupdescribing the spatial relationship (step 808). If present, the parserrecords the frame-packed picture as a candidate reference picture. Moregenerally, for the implicit signaling, the positions and sizes of thesub-picture tracks are considered expressed in the last pictureresulting from the different processing steps between the capture 110and the encoding 140. These different processing are reflected in therestricted scheme information box ‘rinf’. For example, when the contentpreparation contains projection 120, frame packing 125 and region-wisepacking 130, the RestrictedSchemeInfoBox ‘rinf’ box contains in itsSchemeTypeBox a ‘povd’ box indicating that a projection has beenapplied. This ‘povd’ box may itself contain a structure describing theregion wise packing done at 130, for example as a RegionWisePackingBox‘rwpk’. As well, a stereo video box is present, for example in aCompatibleSchemeTypeBox, to indicate the frame packing implemented bymeans 125.

For optimized implicit mode and in closed systems, encapsulation andparser may exchange configuration information or define settings todeclare a pre-defined default mode for sub-picture track description.For example, they may agree that sub-picture tracks always reference theprojected image when the media contains omnidirectional content.

FIG. 9 illustrates a system 991 995 comprising at least one of anencoder 950 or a decoder 900 and a communication network 999 accordingto embodiments of the present invention. According to an embodiment, thesystem 995 is for processing and providing a content (for example, avideo and audio content for displaying/outputting or streamingvideo/audio content) to a user, who has access to the decoder 900, forexample through a user interface of a user terminal comprising thedecoder 900 or a user terminal that is communicable with the decoder900. Such a user terminal may be a computer, a mobile phone, a tablet orany other type of a device capable of providing/displaying the(provided/streamed) content to the user. The system 995 obtains/receivesa bitstream 901 (in the form of a continuous stream or a signal—e.g.while earlier video/audio are being displayed/output) via thecommunication network 999. According to an embodiment, the system 991 isfor processing a content and storing the processed content, for examplea video and audio content processed for displaying/outputting/streamingat a later time. The system 991 obtains/receives a content comprising anoriginal sequence of images 951, for example corresponding to a wideview scene in embodiments of the invention, which is received andprocessed by the encoder 950, and the encoder 950 generates a bitstream901 that is to be communicated to the decoder 900 via a communicationnetwork 991. The bitstream 901 is then communicated to the decoder 900in a number of ways, for example it may be generated in advance by theencoder 950 and stored as data in a storage apparatus in thecommunication network 999 (e.g. on a server or a cloud storage) until auser requests the content (i.e. the bitstream data) from the storageapparatus, at which point the data is communicated/streamed to thedecoder 900 from the storage apparatus. The system 991 may also comprisea content providing apparatus for providing/streaming, to the user (e.g.by communicating data for a user interface to be displayed on a userterminal), content information for the content stored in the storageapparatus (e.g. the title of the content and other meta/storage locationdata for identifying, selecting and requesting the content), and forreceiving and processing a user request for a content so that therequested content can be delivered/streamed from the storage apparatusto the user terminal. Advantageously, in embodiments of the invention,the user terminal is a head mounted display. Alternatively, the encoder950 generates the bitstream 901 and communicates/streams it directly tothe decoder 900 as and when the user requests the content. The decoder900 then receives the bitstream 901 (or a signal) and performs thedecoding of the sub-picture tracks according to the invention toobtain/generate a video signal 909 and/or audio signal, which is thenused by a user terminal to provide the requested content to the user.

FIG. 3 is a schematic block diagram of a computing device 300 forimplementation of one or more embodiments of the invention. Thecomputing device 300 may be a device such as a micro-computer, aworkstation or a light portable device. The computing device 300comprises a communication bus connected to:

-   -   a central processing unit (CPU) 301, such as a microprocessor;    -   a random access memory (RAM) 302 for storing the executable code        of the method of embodiments of the invention as well as the        registers adapted to record variables and parameters necessary        for implementing the method for reading and writing the        manifests and/or for encoding the video and/or for reading or        generating data under a given file format, the memory capacity        thereof can be expanded by an optional RAM connected to an        expansion port for example;    -   a read only memory (ROM) 303 for storing computer programs for        implementing embodiments of the invention;    -   a network interface 304 that is, in turn, typically connected to        a communication network over which digital data to be processed        are transmitted or received. The network interface 304 can be a        single network interface, or composed of a set of different        network interfaces (for instance wired and wireless interfaces,        or different kinds of wired or wireless interfaces). Data are        written to the network interface for transmission or are read        from the network interface for reception under the control of        the software application running in the CPU 301;    -   a user interface (UI) 305 for receiving inputs from a user or to        display information to a user;    -   a hard disk (HD) 306;    -   an I/O module 307 for receiving/sending data from/to external        devices such as a video source or display.

The executable code may be stored either in read only memory 303, on thehard disk 306 or on a removable digital medium such as for example adisk. According to a variant, the executable code of the programs can bereceived by means of a communication network, via the network interface304, in order to be stored in one of the storage means of thecommunication device 300, such as the hard disk 306, before beingexecuted.

The central processing unit 301 is adapted to control and direct theexecution of the instructions or portions of software code of theprogram or programs according to embodiments of the invention, whichinstructions are stored in one of the aforementioned storage means.After powering on, the CPU 301 is capable of executing instructions frommain RAM memory 302 relating to a software application after thoseinstructions have been loaded from the program ROM 303 or the hard-disc(HD) 306 for example. Such a software application, when executed by theCPU 301, causes the steps of the flowcharts shown in the previousfigures to be performed.

In this embodiment, the apparatus is a programmable apparatus which usessoftware to implement the invention. However, alternatively, the presentinvention may be implemented in hardware (for example, in the form of anApplication Specific Integrated Circuit or ASIC).

Although the present invention has been described hereinabove withreference to specific embodiments, the present invention is not limitedto the specific embodiments, and modifications will be apparent to aperson skilled in the art which lie within the scope of the presentinvention.

For example, the present invention may be embedded in a device like acamera, a smartphone, a head-mounted display or a tablet that acts as aremote controller for a TV or for multimedia display, for example tozoom in onto a particular region of interest. It can also be used fromthe same devices to have personalized browsing experience of amultimedia presentation by selecting specific areas of interest. Anotherusage from these devices and methods by a user is to share with otherconnected devices some selected sub-parts of his preferred videos. Itcan also be used with a smartphone or tablet to monitor what happens ina specific area of a building put under surveillance provided that thesurveillance camera supports the method for providing data according tothe invention.

Many further modifications and variations will suggest themselves tothose versed in the art upon making reference to the foregoingillustrative embodiments, which are given by way of example only andwhich are not intended to limit the scope of the invention, that scopebeing determined solely by the appended claims. In particular thedifferent features from different embodiments may be interchanged, whereappropriate.

1. A method for encapsulating encoded timed media data into a plurality of tracks forming one same group of tracks, said media data corresponding to one or more video sequences, the method comprising for the plurality of tracks forming the group of tracks; providing descriptive information about the spatial relationship of spatial parts of frames encapsulated in the plurality of tracks, wherein said descriptive information indicates whether the region, covered by the spatial parts of frames encapsulated in the plurality of tracks, forms an entire sphere or not.
 2. The method of claim 1, wherein said descriptive information is provided in a same data structure comprising descriptive information shared by the plurality of tracks.
 3. The method of claim 2, wherein the data structure is a TrackGroupTypeBox.
 4. The method of claim 1, wherein said descriptive information comprises coverage information of the region covered by the spatial parts of frames encapsulated in the plurality of tracks, when the region covered by the spatial parts is not the entire sphere.
 5. The method of claim 4, wherein said descriptive information is not comprising the coverage information of the region covered by the spatial parts of frames encapsulated in the plurality of tracks, when the region covered by the spatial parts is the entire sphere.
 6. A method for generating a media file comprising capturing one or more video sequences, encoding media data corresponding to the one or more video sequences, encapsulating the encoded media data into a plurality of tracks forming one same group of tracks according to the encapsulating method of claim 1, and generating at least one media file comprising said plurality of tracks.
 7. A method for obtaining at least one frame from a media file comprising encoded timed media data encapsulated into a plurality of tracks forming one same group of tracks, said media data corresponding to one or more video sequences, the method comprising: parsing information associated with the plurality of tracks forming the group of tracks, wherein the parsed information comprises descriptive information about the spatial relationship of spatial parts of frames encapsulated in the plurality of tracks, said descriptive information indicating whether the region covered by the spatial part encapsulated in the plurality of tracks forms an entire sphere or not.
 8. A computing device for encapsulating encoded timed media data into a plurality of tracks forming one same group of tracks, said media data corresponding to one or more video sequences, the computing device being configured for the plurality of tracks forming the group of tracks for: providing descriptive information about the spatial relationship of spatial parts of frames encapsulated in the plurality of tracks, wherein said descriptive information indicates whether the region, covered by the spatial parts of frames encapsulated in the plurality of tracks, forms an entire sphere or not.
 9. A computing device for obtaining at least one frame from a media file comprising encoded timed media data encapsulated into a plurality of tracks forming one same group of tracks, said media data corresponding to one or more video sequences, the computing device being configured for: parsing information associated with the plurality of tracks forming the group of tracks, wherein the parsed information comprises descriptive information about the spatial relationship of spatial parts of frames encapsulated in the plurality of tracks, said descriptive information indicating whether the region covered by the spatial parts plurality of tracks encapsulated in the plurality of tracks forms an entire sphere or not.
 10. A computer-readable storage medium storing instructions of a computer program for implementing a method according to claim
 1. 